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of the system and weightings of the linear combination of components, the weight- 
ings having values based on data obtained from the at least one training sample, the 
at least one training sample having a known feature; obtaining a model of a prob- 
ability distribution of the known feature, wherein the model is conditional on the 
linear combination of components; obtaining a prior distribution for the weighting 
of the linear combination of the components, the prior distribution comprising a hy- 
perprior having a high probability density close to zero, the hyperprior being such 
that it is not a Jeffreys hyperprior; combining the prior distribution and the model 
to generate a posterior distribution; and identifying the subset of components based 
on a set of the weightings that maximise the posterior distribution. 



WO 2004/104856 Al lilDIOQBIIilllDIIIIDflllllinUIll 



TN, TR, TT, TZ, UA. UG, US, UZ, VC, VN, YU, ZA, ZM, 

zw. 



(84) Designated States (unless otherwise indicated, for every 
kind of regional protection available)'. ARIPO (BW, GH, 
GM, KB, LS, MW, MZ, NA, SD, SL, SZ, TZ, UG, ZM, 
ZW), Eurasian (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), 
European (AT, BE, BG, CH, CY, CZ, DB, DK, EE, ES, FI, 
FR, GB, GR, HU, IE, IT, LU, MC, NL, PL, PT, RO, SE, SI, 



SK, TO), OAPI (BF, BJ. CF, CG, CI, CM, GA, GN, GQ, 
GW, ML, MR, NE, SN, TD, TG). 

Published: 

— with international search report 

For two-letter codes and other abbreviations, refer to the "Guid- 
ance Notes on Codes and Abbreviations" appearing at the begin- 
ning of each regular issue of the PCT Gazette. 




10/552782 

tf> JC09Rec'dPCT/PT0 16 SEP 



WO 2004/104856 



PCT/AU2004/000696 



10 



15 



20 



25 



30 



- 1 - 

A METHOD FOR IDENTIFYING A SUBSET OF COMPONENTS OF A SYSTEM 
FIELD OF THE INVENTION 

The present invention relates to a method and apparatus for 
identifying components of a system from data generated from 
samples from the system, which components are capable of 
predicting a feature of the sample within the system and, 
particularly, but not exclusively, the present invention 
relates to a method and apparatus for identifying components 
of a biological system from data generated by a biological 
method, which components are capable of predicting a feature 
of interest associated with a sample applied to the 
biological system. 

BACKGROUND OF THE INVENTION 

There are any n umb er of systems in existence that can be 
classified according to one or more features thereof. The 
term **system" as used throughout this specification is 
considered to include all types of systems from which data 
(e.g. statistical data) can be obtained. Examples of such 
systems include chemical systems, financial systems and 
geological systems. It is desirable to be able to utilise 
data obtained from the systems to identify particular 
features of samples from the system; for instance, to assist 
with analysis of financial system to identify groups such as 
those who have good credit and those who are a credit risk. 
Often the data obtained from the systems is relatively large 
and therefore it is desirable to identify components of the 
systems from the data, the components being predictive of 
the particular features of the samples from the system. 
However, when the data is relatively large it can be 
difficult to identify the components because there is a 
large amount of data to process, the majority of which may 
not provide any indication or little indication of the 
features of a particular sample from which the data is 
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taken. Furthermore, components that are identified using a 
training sample are often ineffective at identifying 
features on test sample data when the test sample data has a 
high degree of variability relative to the training sample 
5 data- This is often the case in situations when, for 

example, data is obtained from many different sources, as it 
is often difficult to control the conditions under which the 
data is collected from each individual source. 



10 An example of a type of system where these problems are 

particularly pertinent, is a biological system, in which the 
components could include, for example, particular genes or 
proteins. Recent advances in biotechnology have resulted in 
the development of biological methods for large scale 

15 screening of systems and analysis of samples. Such methods 
include, for example, microarray analysis using DNA or RNA, 
proteomics analysis, proteomics electrophoresis gel 
analysis, and high throughput screening techniques. These 
types of methods often result in the generation of data that 

20 can have up to 3 0,000 or more components for each sample 
that is tested. 

It is highly desirable to be able identify features of 
interest in samples from biological systems. For example, 

25 to classify groups such as "diseased" and "non-diseased" . 
Many of these biological methods would be useful as 
diagnostic tools predicting features of a sample in the 
biological systems. For example, identifying diseases by 
screening tissues or body fluids, or as tools for 

30 determining, for example, the efficacy of pharmaceutical 
compounds . 

Use of biological methods such as biotechnology arrays in 
such applications to date has been limited due to the large 
35 amount of data that is generated from these types of 

methods, and the lack of efficient methods for screening the 
data for meaningful results. Consequently, analysis of 
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biological data using existing methods is time consuming, 
prone to false results and requires large amounts of 
computer memory if a meaningful result is to be obtained 
from the data. This is problematic in large scale screening 
5 scenarios where rapid and accurate screening is required. 

It is therefore desirable to have a method, in particular 
for analysis of biological data, and more generally, for an 
improved method of analysing data from a system in order to 
10 predict a feature of interest for a sample from the system. 

SUMMARY OF THE INVENTION 

According to a first aspect of the present invention, there 
15 is provided a method of identifying a subset of components 
of a system based on data obtained from the system using at 
least one training sample from the system, the method 
comprising the steps of: 

obtaining a linear combination of components of the 
20 system and weightings of the linear combination of 

components, the weightings having values based on the data 
obtained from the system using the at least one training 
sample, the at least one training sample having a known 
feature; 

25 obtaining a model of a probability distribution of the 

known feature, wherein the model is conditional on the 

linear combination of components; 

obtaining a prior distribution for the weighting of 

the linear combination of the components, the prior 
30 distribution comprising a hyperprior having a high 

probability density close to zero, the hyperprior being such 

that it is not a Jeffreys hyperprior; 

combining the prior distribution and the model to 

generate a posterior distribution; and 
35 identifying the subset of components based on a set of 

the weightings that maximise the posterior distribution. 
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The method utilises training samples having the known 
feature in order to identify the subset of components which 
can predict a feature for a training sample. Subsequently, 
knowledge of the subset of components can be used for tests, 
5 for example clinical tests, to predict a feature such as 
whether a tissue sample is malignant or benign, or what is 
the weight of a tumour, or provide an estimated time for 
survival of a patient having a particular condition. 

10 The term "feature" as used throughout this specification 
refers to any response or identifiable trait or character 
that is associated with a sample. For example, a feature 
may be a particular time to an event for a particular 
sample, or the size or quantity of a sample, or the class or 

15 group into which a sample can be classified. 

Preferably, the step of obtaining the linear combination 
comprises the step of using a Bayesian statistical method to 
estimate the weightings. 

20 

Preferably, the method further comprises the step of making 
am apriori assumption that a majority of the components are 
unlikely to be components that will form part of the subset 
of components. 

25 

The apriori assumption has particular application when there 
are a large amount of components obtained from the system. 
The apriori assumption is essentially that the majority of 
the weightings are likely to be zero. The model is 

30 constructed such that with the apriori assumption in mind, 
the weightings are such that the posterior probability of 
the weightings given the observed data is maximised. 
Components having a weighting below a pre -determined 
threshold (which will be the majority of them in accordance 

35 with the apriori assumption) are ignored. The process is 
iterated until the correct diagnostic components are 
identified. Thus, the method has the potential to be quick. 
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mainly because of the aprlori assumption, which results in 
rapid elimination of the majority of components. 

Preferably , the hyperprior comprises one or more adjustable 
5 parameter that enable the prior distribution near zero to be 
varied. 

Most features of a system, typically exhibit a probability 
distribution, and the probability distribution of a feature 

10 can be modelled using statistical models that are based on 
the data generated from the training samples. The present 
invention utilises statistical models that model the 
probability distribution for a feature of interest or a 
series of features of interest. Thus, for a feature of 

15 interest having a particular probability distribution, an 
appropriate model is defined that models that distribution. 

Preferably, the method comprise a mathematical equation in 
the form of a likelihood function that provides the 
20 probability distribution based on data obtained from the at 
least one training sample. 

Preferably, the likelihood function is based on a previously 
described model for describing some probability 
25 distribution. 

Preferably, the step of obtaining the model comprises the 
step of selecting the model from a group comprising a 
multinomial or binomial logistic regression, generalised 
30 iinear model, Cox's proportional hazards model, accelerated 
failure model and parametric survival model. 

In a first embodiment, the likelihood function is based on 
the multinomial or binomial logistic regression. The 
35 binomial or multinomial logistic regression preferably 
models a feature having a multinomial or binomial 
distribution. A binomial distribution is a statistical 
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distribution having two possible classes or groups such as 
an on/off state. Examples of such groups include 
dead/alive, improved/not improved, depressed/not depressed. 
A multinomial distribution is a generalisation of the 
5 binomial distribution in which a plurality of classes or 
groups are possible for each of a plurality of samples, or 
in other words, a sample may be classified into one of a 
plurality of classes or groups. Thus, by defining a 
likelihood function based on a multinomial or binomial 

10 logistic regression, it is possible to identify subsets of 

components that are capable of classifying a sample into one 
of a plurality of pre-defined groups or classes. To do 
this, training samples are grouped into a plurality of 
sample groups (or ^classes*) based on a predetermined 

15 feature of the training samples in which the members of each 
sample group have a common feature and are assigned a common 
group identifier* A likelihood function is formulated based 
on a multinomial or binomial logistic regression conditional 
on the linear combination (which incorporates the data 

20 generated from the grouped training samples) . The feature 
may be any desired classification by which the training 
samples are to be grouped. For example, the features for 
classifying tissue samples may be that the tissue is normal, 
malignant, benign, a leukemia cell, a healthy cell, that the 

25 training samples are obtained from the blood of patients 
having or not having a certain condition, or that the 
training samples are from a cell from one of several types 
of cancer as compared to a normal cell. 



30 In the first embodiment, the likelihood function based on 
the multinomial or binomial logistic regression is of the 
form: 



n 



f 

0-1 


- 




1 


* 


\ 




► A 


G-l _ 


J 



vvuu<nu4ttS 0 [nii p://wvw.g eanepaier)t.co^ ■ 



WO 2004/104856 PCT/AU2004/000696 

- 7 - 

wherein 

xf |5 g is a linear combination generated from input data from 
training sample i with component weights fi g t 
5 xf is the components for the 1 th Row of X and p g is a set of 
component weights for sample class gr; and 

X is data from n training samples comprising p components 
and the etk are defined further in this specif ication* 

10 In a second embodiment/ the likelihood function is based on 
the ordered categorical logistic regression • The ordered 
categorical logistic regression models a binomial or 
multinomial distribution in which the classes are in a 
particular order (ordered classes such as for example, 

15 classes of increasing or decreasing disease severity) . By- 
defining a likelihood function based on an ordered 
categorical logistic regression, it is possible to identify 
a subset of components that is capable of classifying a 
sample into a class wherein the class is one of a plurality 

20 of predefined ordered classes* By defining a series of 
group indentif iers in which each group identifier 
corresponds to a member of an ordered class, and grouping 
the training samples into one of the ordered classes based 
on predetermined features of the training samples, a 

25 likelihood function can be formulated based on a categorical 
ordered logistic regression which is conditional on the 
linear combination (which incorporates the data generated 
from the grouped training samples) . 

3 0 In the second embodiment, the likelihood function based on 
the categorical ordered logistic regression is of the form: 

Wherein 

y ik is the probability that training sample i belongs to a 
35 class with identifier less than or equal to Jfc (where the 
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total of ordered classes is G) .The r± is defined further in 
the document. 

In a third embodiment of the present invention, the 
5 likelihood function is based on the generalised linear 
model. The generalised linear model preferably models a 
feature that is distributed as a regular exponential family 
of distributions. Examples of regular exponential family of 
distributions include normal distribution, guassian 

10 distribution, poisson distribution, gamma distribution and 
inverse gaussian distribution. Thus, in another embodiment 
of the method of the invention, a subset of components is 
identified that is capable of predicting a predefined 
characteristic of a sample which has a distribution 

15 belonging to a regular exponential family of distributions. 
In particular by defining a generalised linear model which 
models the characteristic to be predicted. Examples of a 
characteristic that may be predicted using a generalised 
linear model include any quantity of a sample that e x hi b its 

20 the specified distribution such as, for example, the weight, 
size or other dimensions or quantities of a sample. 

In the third embodiment, the generalised linear model is of 
the form: 

L - log p(y | ft *) = ± {MzMl + c(y/ ^) } 

25 

where y = (yi,~, y n ) T and a. t (<f>) « <f> /w± with the w t being a 
fixed set of known weights and <f> a single scale parameter. 
The other terms in this expression are defined later in this 
document. 

30 

In a fourth embodiment, the method of the present invention 
may be used to predict the time to an event for a sample by 
utilising the likelihood function that is based on a hazard 
model, which preferably estimates the probability of a time 
35 to an event given that the event has not taken place at the 
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time of obtaining- the data. In the fourth embodiment, the 
likelihood function is selected from the group comprising a 
Cox's proportional hazards model, parametric survival model 
and accelerated failure times model. Cox's proportional 
5 hazards model permits the time to an event to be modelled on 
a set of components and component weights without making 
restrictive assumptions about time. The accelerated failure 
model is a general model for data consisting of survival 
times in which the component measurements are assumed to act 

10 multiplicatively on the time-scale, and so affect the rate 
at which an individual proceeds along the time axis. Thus, 
the accelerated survival model can be interpreted in terms 
of the speed of progression of, for example, disease. The 
parametric survival model is one in which the distribution 

15 function for the time to an event (eg survival time) is 
modelled by a known distribution or has a specified 
parametric formulation. Among the commonly used survival 
distributions are the Weibull, exponential and extreme value 
distributions . 

20 

In the fourth embodiment, a subset of components capable of 
predicting the time to an event for a sample is identified 
by defining a likelihood based on Cox' s proportional 
standards model, a parametric survival model or an 
25 accelerated survival times model, which comprises measuring 
the time elapsed for a plurality of samples from the time 
the sample is obtained to the time of the event. 

In the fourth embodiment, the likelihood function for 
30 predicting the time to an event is of the form: 

N 

Log (Partial ) Likelihood = ]T g t {fi,<P; X, y, cj 

where f} 1 ~\P\.p2. ->P p ) and <p l = [<P\,<P2>"'><Pq j are the * odel 
parameters, y is a vector of observed times and c is an 
indicator vector which indicates whether a time is a true 
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survival time or a censored survival time. 

In the fourth embodiment , the likelihood function based on 
Cox's proportional hazards model is of the form: 



'(£i£)- n 



exp 



exp 



where the observed times are be ordered in increasing 

magnitude denoted as £~('(/)''(2)'***'(tf))' t (i+l) > *(i) m and Z denotes 

the Nxp matrix that is the re -arrangement of the rows of X 
10 where the ordering of the rows of Z corresponds to the 

ordering induced by the ordering of t . Also fi T =(A«/*2'* # **Ai) ' 

Zj a the j th row of Z, and <Ry = {i:i = y t y + I,*-,A^}« the risk set 

at the j" 1 ordered event time /(y) • 



15 In the fourth embodiment/ wherein the likelihood function is 
based on the Parametric Survival model it is of the form* 



N 



log 



^(wf)J 



where /jf = /l(y,*;#Je^(Jfi/?) and A denotes the integrated 
20 parametric hazard function. 



For any defined models , the weightings are typically 
estimated using a Bayesian statistical model (Kotz and 
Johnson, 1983) in which a posterior distribution of the 
25 component weights is formulated which combines the 

likelihood function and a prior distribution. The component 
weightings are estimated by maximising the posterior 
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distribution of the weightings given the data generated for 
the at least one training sample. Thus, the objective 
function to be maximised consists of the likelihood function 
based on a model for the feature as discussed above and a 
5 prior distribution for the weightings. 

Preferably, the prior distribution is of the form: 

10 wherein v is a p x 1 vector of hyperparameters, and where 
p\fi \v 2 j is W(0,diag (v z jj and p[v 2 ) is some hyperprior 
distribution for v 2 . 

Preferably, the hyperprior comprises a gamma distribution 
15 with a specified shape and scale parameter* 

This hyperprior distribution (which is preferably the same 
for all embodiments of the method) may be expressed using 
different notational conventions, and in the detailed 
20 description of the embodiments (see below) , the following 
notational conventions are adopted merely for convenience 
for the particular embodiment: 

As used herein, when the likelihood function for the 
25 probability distribution is based on a multinomial or 

binomial logistic regression, the notation for the prior 
distribution is: 

x 1 8=1 



30 



where fi T = and t r 
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and p(p g \*l) is #(o,diag{r*}) and P(t\) is some hyperprior 
distribution for x\ . 

As used herein, when the likelihood function for the 
5 probability distribution is based on a categorical ordered 
logistic regression, the notation for the prior distribution 
is: 

10 where A»A»"'»A aro component weights, ^(Ak) is^(0,v/ jand 
P^V/j some hyperprior distribution for v*. 

As used herein, when the likelihood function for the 
distribution is based on a generalised linear model, the 
15 notation for the prior distribution is: 

P(P)~ \p{P \v 2 )p{v 2 )dv 2 

r 2 

wherein v is a p x 1 vector of hyperparameter s , and where 
p{/3 |v 2 ) is w(o,diag{v 2 }) and p{y 2 ) is some prior distribution 
20 for v*. 

As used herein, when the likelihood function for the 
distribution is based on a hazard model, the notation for 
the prior distribution is: 

25 

where p(/3*\t ) is w(o,diag{T 2 }) and p[r ) some hyperprior 
distribution for ? . 
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The prior distribution comprises a hyperprior that ensures 
that zero weightings are used whenever possible* 

t 

In an alternative embodiment/ the hyperprior is an inverse 
5 g annua distribution in which each /, =l/v f has an independent 
gamma distribution. 

In a further alternative embodi m e n t/ the hyperprior is a 
gamma distribution in which each vf , X t or rf (depending on the 
10 context) has an independent gamma distribution. 

As discussed previously , the prior distribution and the 
likelihood function are combined to generate a posterior 
distribution. The posterior distribution is preferably of 
15 the form: 

p(0<pv\y)a L{y\fiq>)p{fi\v)p(y) 
or 

20 

wherein Z|>>|/?,0>j is the likelihood function. 

Preferably/ the step of identifying the subset of components 
comprises the step of using an iterative procedure such that 
25 the probability density of the posterior distribution is 
maximised. 

During the iterative procedure, component weightings having 
a value less than a pre -determined threshold are eliminated, 
30 preferably by setting those component weights to zero. This 
results in the substantially elimination of the 
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corresponding component. 

Preferably, the iterative procedure is an EM algorithm. 

5 The EM algorithm produces a sequence of component weighting 
estimates that converge to give component the weightings 
that maximise the probability density of the posterior 
distribution. The EM algorithm consists of two steps, known 
as the E or Expectation step and the M, or Maximisation 

10 step. In the E step, the expected value of the log- 
posterior function conditional on the observed data is 
determined. In the M step, the expected log-posterior 
function is maximised to give updated component weight 
estimates that increase the posterior. The two steps are 

15 alternated until convergence of the E step and the M step is 
achieved, or in other words, until the expected value and 
the maximised value of the expected log-posterior function 
converges . 

2 0 It is envisaged that the method of the present invention may 
be applied to any system from which measurements can be 
obtained, and preferably systems from which very large 
amounts of data are generated. Examples of systems to which 
the method of the present invention may be applied include 

25 biological systems, chemical systems, agricultural systems, 
weather systems, financial systems including, for example, 
credit risk assessment systems, insurance systems, marketing 
systems or company record systems, electronic systems, 
physical systems, astrophysics systems and mechanical 

30 systems. For example, in a financial system, the samples 
may be particular stock and the components may be 
measurements made on any number of factors which may affect 
stock prices such as company profits, employee numbers, 
rainfall values in various cities, number of shareholders 

35 etc. 
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The method of the present invention is particularly suitable 
for use in analysis of biological systems* The method of 
the present invention may be used to identify subsets of 
components for classifying samples from any biological 
5 system which produces measurable values for the components 
and in which the components can be uniquely labelled. In 
other words, the components are labelled or organised in a 
manner which allows data from one component to be 
distinguished from data from another component. For 

10 example, the components may be spatially organised in, for 
example, an array which allows data from each component to 
be distinguished from another by spatial position, or each 
component may have some unique identification associated 
with it such as an identification signal or tag. For 

15 example, the con^>onents may be bound to individual carriers, 
each carrier having a detectable identification signature 
such as quantum dots (see for example, Rosenthal, 2001, 
Nature Biotech 19: 621-622; Han et al. (2001) Nature - 
Biotechnology 19$ 631-635), fluorescent markers (see for 

20 example, Fu et al, (1999) Nature Biotechnology 17s 1109- 

1111) , bar-coded tags (see for example, Lockhart and trulson 
(2001) Nature Biotechnology 19: 1122-1123). 

In a particularly preferred embodiment, the biological 
25 system is a biotechnology array. Examples of biotechnology 
arrays include oligonucleotide arrays, DNA arrays, DNA 
microarrays, RNA arrays, RNA microarrays, DNA microchips, 
RNA microchips, protein arrays, protein microchips, antibody 
arrays, chemical arrays, carbohydrate arrays, proteomics 
30 arrays, lipid arrays. In another embodiment, the biological 
system may be selected from the group including, for 
example, DNA or RNA electrophoresis gels, protein or 
proteomics electrophoresis gels, biomolecular interaction 
analysis such as Biacore analysis, amino acid analysis, 
35 ADMETox screening (see for example High- throughput ADMETox 

estimation: In Vitro and In Silico approaches (2002), Ferenc 
Darvas and Gyorgy Dorman (Eds), Biotechniques Press), 
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protein electrophoresis gels and proteomics electrophoresis 
gels* 

The components may be any measurable component of the 
system. In the case of a biological system, the components 
may be, for example, genes or portions thereof, DNA 
sequences, RNA sequences, peptides, proteins, carbohydrate 
molecules, lipids or mixtures thereof, physiological 
components, anatomical components , epidemiological 
components or chemical components . 

The training samples may be any data obtained from a system 
in which the feature of the sample is known. For example, 
training samples may be data generated from a sample applied 
to a biological system. For example, when the biological 
system is a DNA microarray, the training sample may be data 
obtained from the array following hybridisation of the array 
with RNA extracted from cells having a known feature, or 
cDNA synthesised from the RNA extracted from cells, or if 
the biological system is a proteomics electrophoresis gel, 
the training sample may be generated from a protein or cell 
extract applied to the system. 

It is envisaged that an embodiment of a method of the 
present invention may be used in re- evaluating or evaluating 
test data from subjects who have presented mixed results in 
response to a test treatment. Thus, there is a second 
aspect to the present invention. 

The second aspect provides a method for identifying a subset 
of components of a subject which are capable of classifying 
the subject into one of a plurality of predefined groups, 
wherein each group is defined by a response to a test 
treatment, the method comprising the steps of: 

exposing a plurality of subjects to the test treatment 
and grouping the subjects into response groups based on 
responses to the treatment; 
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measuring components of the subjects; and 
identifying a subset of components that is capable of 

classifying the subjects into response groups using a 

statistical analysis method. 

5 

Preferably, the statistical analysis method comprises the 
method according to the first aspect of the present 
invention* 

10 Once a subset of components has been identified, that subset 
can be used to classify subjects into groups such as those 
that are likely to respond to the test treatment and those 
that are not. In this manner, the method of the present 
invention permits treatments to be identified which may be 

15 effective for a fraction of the population, and permits 

identification of that fraction of the population that will 
be responsive to the test treatment* 

According to a third aspect of the present invention, there 
20 is provided an apparatus for identifying a subset of 

components of a subject, the subset being capable of being 
used to classify the subject into one of a plurality of 
predefined response groups wherein each response group # is 
formed by exposing a plurality of subjects to a test 
25 treatment and grouping the subjects into response groups 
based on the response to the treatment, the apparatus 
comprising: 

an input for receiving measured components of the 
subjects; and 

3 0 processing means operable to identify a subset of 

components that is capable of being used to classify the 
subjects into response groups using a statistical analysis 
method • 

35 Preferably, the statistical analysis method comprises the 
method according to the first or second aspect. 
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According to a fourth aspect of the present invention, there 
is provided a method for identifying a subset of components 
of a subject that is capable of classifying the subject as 
being responsive or non-responsive to treatment with a test 
5 compound, the method comprising the steps of: 

exposing a plurality of subjects to the test compound 
and grouping the subjects into response groups based on each 
subjects response to the test compound; 

measuring components of the subjects; and 
10 identifying a subset of components that is capable of 

being used to classify the subjects into response groups 
using a statistical analysis method. 

Preferably, the statistical analysis method comprises the 
15 method according to the first aspect. 

According to a fifth aspect of the present invention, there 
is provided an apparatus for identifying a subset of 
components of a subject, the subset being capable of being 

20 used to classify the subject into one of a plurality of 

predefined response groups wherein each response group is 
formed by exposing a plurality of subjects to a compound and 
grouping the subjects into response groups based on the 
response to the compound, the apparatus comprising: 

25 an input operable to receive measured components of 

the subjects; 

processing means operable to identify a subset of 
components that is capable of classifying the subjects into 
response groups using a statistical analysis method. 

30 

Preferably, the statistical analysis method comprises the 
method according to the first or second aspect of the 
present invention. 

35 The components that are measured in the second to fifth 

aspects of the invention may be, for example, genes or small 
nucleotide polymorphisms (SNPs) , proteins, antibodies, 
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carbohydrates, lipids or any other measurable component of 
the subject* 

In a particularly embodiment of the fifth aspect, the 
5 compound is a pharmaceutical compound or a composition 

comprising a pharmaceutical compound and a pharmaceutical^ 
acceptable carrier. 

The identification method of the present invention may be 
10 implemented by appropriate computer software and hardware. 

According to a sixth aspect of the present invention, there 
is provided an apparatus for identifying a subset of 
components of a system from data generated from the system 
15 from a plurality of samples from the system, the subset 

being capable of being used to predict a feature of a test 
sample, the apparatus comprising: 

a processing means operable to: 

obtain a linear combination of components of the system 
20 and obtain weightings of the linear combination of 

components, each of the weightings having a value based on 
data obtained from at least one training sample, the at 
least one training sample having a known feature; 

obtaining a model of a probability distribution of a 
25 second feature, wherein the model is conditional on the 
linear combination of components; 

obtaining a prior distribution for the weightings of 
the linear combination of the components, the prior 
distribution comprising an adjustable hyperprior which 
30 allows the prior probability mass close to zero to be varied 
wherein the hyperprior is not a Jeffrey's hyperprior; 

combining the prior distribution and the model to 
generate a posterior distribution; and 

identifying the subset of components having component 
35 weights that maximize the posterior distribution. 
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Preferably, the processing means comprises a computer 
arranged to execute software. 

According to a seventh aspect of the present invention, 
5 there is provided a computer program which, when executed by 
a computing apparatus, allows the computing apparatus to 
carry out the method according to the first aspect of the 
present invention. 

10 The computer program may implement any of the preferred 

algorithms and method steps of the first or second aspect of 
the present invention which are discussed above. 

According to an eighth aspect of the present invention, 
15 there is provided a computer readable medium comprising the 
computer program according with the seventh aspect of the 
present invention. 

According to a ninth aspect of the present invention, there 
20 is provided a method of testing a sample from a system to 

identify a feature of the sample, the method comprising the 
steps of testing for a subset of components that are 
diagnostic of the feature, the subset of components having 
been determined by using the method according to the first 
25 or second aspect of the present invention. 

Preferably, the system is a biological system. 

According to a tenth aspect of the present invention, there 
3 0 is provided an apparatus for testing a sample from a system 
to determine a feature of the sample, the apparatus 
comprising means for testing for components identified in 
accordance with the method of the first or second aspect of 
the present invention. 

3 5 

According to an eleventh aspect of the present invention, 
there is provided a computer program which, when executed by 
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on a computing device , allows the computing device to carry- 
out a method of identifying components from a system that 
are capable of being used to predict a feature of a test 
sample from the system, and wherein a linear combination of 
5 components and component weights is generated from data 
generated from a plurality of training samples, each 
training sample having a known feature, and a posterior 
distribution is generated by combining a prior distribution 
for the component weights comprising sin adjustable 
10 hyperprior which allows the probability mass close to zero 
to be varied wherein the hyperprior is not a Jeffrey' s 
hyperprior, and a model that is conditional on the linear 
combination, to estimate component weights which maximise 
the posterior distribution* 

15 

Where aspects of the present invention are implemented by 
way of a computing device, it will be appreciated that any 
appropriate computer hardware e.g. a PC or a mainframe or a 
networked computing infrastructure, may be used. 

20 

According to a twelfth aspect of the present invention, 
there is provided a method of identifying a subset of 
components of a biological system, the subset being capable 
of predicting a feature of a test sample from the biological 

25 system, the method comprising the steps of: 

obtaining a linear combination of components of the 
system and weightings of the linear combination of 
components, each of the weightings having a value based on 
data obtained from at least one training sample, the at 

30 least one training sample having a known first feature; 

obtaining a model of a probability distribution of a 
second feature, wherein the model is conditional on the 
linear combination of components; 

obtaining a prior distribution for the weightings of 

35 the linear combination of the components , the prior 

distribution comprising an adjustable hyperprior which 
allows the probability mass close to zero to be varied; 
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combining the prior distribution and the model to 
generate a posterior distribution; and 

identifying the subset of components based on the 
weightings that maximize the posterior distribution. 

BRIEF DESCRIPTION OP THE DRAWINGS 



Notwithstanding any other embodiments that zaay fall within 
the scope of the present invention, an embodiment of the 
10 present invention will now be described, by way of example 
only, with reference to the accompanying figures, in which: 

figure 1 provides a flow chart of a method according 
to the embodiment of the present invention; 

15 

figure 2 provides a flow chart of another method 
according to the embodiment of the present invention; 

figure 3 provides a block diagram of an apparatus 
2 0 according to the embodiment of the present invention; 

figure 4 provides a flow chart of a further method 
according to the embodiment of the present invention; 

25 figure 5 provides a flow chart of an additional method 

according to the embodiment of the present invention; and 

figure 6 provides a flow chart of yet another method 
according to the embodiment of the present invention. 

30 

DETAILED DESCRIPTION OP AN EMBODIMENT 



The embodiment of the present invention identifies a 
relatively small number of components which can be used to 
35 identify whether a particular training sample has a feature. 
The components are ^diagnostic" of that feature, or enable 
discrimination between samples having a different feature. 
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The number of components selected by the method can be 
controlled by the choice of parameters in the hyperprior • It 
is noted that the hyperprior is a gamma distribution with a 
specified shape and scale parameter. Essentially, from all 
5 the data which is generated from the system, the method of 
the present invention enables identification of a relatively 
small number of components which can be used to test for a 
particular feature. Once those components have been 
identified by this method, the components can be used in 
10 future to assess new samples. The method of the present 
invention utilises a statistical method to eliminate 
components that are not required t9 correctly predict the 
feature . 

15 The inventors have found that component weightings of a 

linear combination of components of data generated from the 
training samples can be estimated in such a way as to 
eliminate the components that are not required to correctly 
predict the feature of the training sample. The result is 

20 that a subset of components are identified which can 

correctly predict the feature of the training sample. The 
method of the present invention thus permits identification 
from a large amount of data a relatively small and 
controllable number of components which are capable of 

25 correctly predicting a feature. 

The method of the present invention also has the advantage 
that it requires usage of less computer memory than prior 
art methods. Accordingly , the method of the present 

3 0 invention cam be performed rapidly on computers such as, for 
example, laptop machines. By using less memory, the method 
of the present invention also allows the method to be 
performed more quickly than other methods which use joint 
(rather than marginal) information on components for 

35 analysis of, for example, biological data. 
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The method of the present invention al8o has the advantage 
that it uses joint rather than marginal information on 
components for analysis • 

5 A first embodiment relating to a multiclass logistic 
regression model will now be described. 

A. Multi Class Logistic regression model 

10 The method of this embodiment utilises the training samples 
in order to identify a subset of components which can 
classify the training samples into pre-defined groups. 
Subsequently, knowledge of the subset of components can be 
used for tests, for example clinical tests, to classify 

15 samples into groups such as disease classes. For example, a 
subset of components of a DNA microarray may be used to 
group clinical samples into clinically relevant classes such 
as, for example, healthy or diseased. 

20 In this way, the present invention identifies preferably a 
small and controllable number of components which can be 
used to identify whether a particular training sample 
belongs to a particular group. The selected components are 
^diagnostic" of that group, or enable discrimination between 

25 groups. Essentially, from all the data which is generated 

from the system, the method of the present invention enables 
identification of a small number of components which can be 
used to test for a particular group. Once those components 
have been identified by this method, the components can be 

3 0 used in future to classify new samples into the groups. The 
method of the present invention preferably utilises a 
statistical method to eliminate components that are not 
required to correctly identify the group the sample belongs 
to. 

35 

The samples are grouped into sample groups (or ^classes") 
based on a pre -determined classification. The classification 
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may be any desired classification by which the training 
s am p] es are to be grouped* For example, the classification 



may be whether the training samples are from a leukemia cell 
or a healthy cell/ or that the training samples are obtained 
5 from the blood of patients having or not having a certain 

condition, or that the training samples are from a cell from 
one of several types of cancer as compared to a normal cell. 

In one embodiment, the input data is organised into an 
10 rcxpdata matrix X = with n training samples and p 

components . Typically, p will be much greater than n • 

In another embodiment, data matrix X may be replaced by an n 
x n kernel matrix K to obtain smooth functions of X as 
15 predictors instead of linear predictors- An exaaqple of the 
kernel matrix K is k^exp ( -0 .5* (xi-Xj) t (Xi-Xj) /a 2 ) where the 
subscript on x refers to a row number in the matrix X. 
Ideally, subsets of the columns of K are selected which give 
sparse representations of these smooth functions- 

20 

Associated with each sample class (group) may be a class 
label y f , where y t =k,ke {l,...,G}, which indicates which of G 
sample classes a training sample belongs to. We write the 
nxl vector with elements y s as y. . Given the vector y we 
25 can define indicator variables 



In one embodiment, the component weights are estimated using 
a Bayesian statistical model (see Kotz and Johnson, 1983) . 

30 Preferably, the weights are estimated by maximising the 
posterior distribution of the weights given the data 
generated from each training sample- This results in an 
objective function to be maximised consisting of two parts - 
The first part a likelihood function and the second a prior 

35 distribution for the weights which ensures that zero weights 
are preferred whenever possible- In a preferred embodiment, 




otherwise 
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the likelihood function is derived from a multiclass 
logistic model. Preferably , the likelihood function ia 
computed from the probabilities: 




(A2) 



(A3) 



PiO-7 C?-l r a \ 



wherein 



10 Pi g is the probability that the training sample with input 
data Xi will be in sample class gr; 

jcfp g is a linear combination generated from input data from 
training sample 1 with component weights p B % 

xf is the components for the 1 th Row of X and p g is a set of 
15 component weights for sample class g; 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the apriori 
assumption that most of the component weights are zero* 

20 

In one embodiment, components weights fig in equation (A2) 
are estimated in a manner whereby most of the values are 
zero, yet the samples can still be accurately classified. 

25 In one embodiment, the component weights are estimated by 
maximising the posterior distribution of the weights given 
the data in the Bayesian model referred to above'. 

Preferably, the component weights are estimated by 



30 
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(a) specifying a hierarchical prior for the component 
weights fi u ... 9 fi a _ x ; and 

(b) specifying a likelihood function for the input data; 

(c) determining the posterior distribution of the weights 
5 given the data using (A5) ; and 

(d) determining component weights which maximise the 
posterior distribution. 

In one embodiment, the hierarchical prior specified for the 
10 parameters P x ^^Pg^\ is of the form: 

r 2 g°\ 



where p' = (#', - ) . * = K P(W) is ^(0,diag(^ )) and 

15 P{*g) is a suitable prior. 

In one embodiment, p{?l) = T\.p{ T Vi where p{ t I) is a P rior 
wherein t^ 2 =l/r£ has an independent gamma distribution ♦ 



20 In another embodiment, is a prior wherein t£ has an 

independent gamma distribution. 

In one embodiment, the likelihood function is L\y\P u .„,P a „ x J of 
the form in equation (8) and the posterior distribution of 
25 fi and T given y is 

p(pt 1 \y) a L(y\fi)p(fi\t 2 )p(t 2 ) (A5) 



In one embodiment, the likelihood function has a first and 
30 second derivative* 

In one embodiment, the first derivative is determined from 
the following algorithm: 
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9log£ 



(A6) 



wherein ^ = (^ t i = l f .-.,Ji) p = !,-.•,») are vectors indicating 

5 membership of sample class gr and probability of class gr 
r espec t i ve ly ♦ 



10 



In one embodiment, the second derivative is determined from 
the following algorithm: 



(A7) 



where S hg is 1 if h equals g and zero otherwise. 

15 Equation A6 and equation A7 may be derived as follows: 

(a) Using equations (Al) , (A2) and (A3), the likelihood 
function of the data can be written as: 



20 



n 



Jf» 



1 + 



fe*f»\ 



(A8) 



(b) Taking logs of equation (A6) and using the fact that 

G 

^e ih =l for all i gives: 



25 



G-l 



( C -' r»« Y\ 
V 8*1 )) 



(A9) 



(c) Differentiating equation (A8) with respect to fig gives 
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whereby e* g ' =[e ig9 i , =(p^,i=l,wj are vectors indicating 
5 membership of sample class gr and probability of class sr 
respectively ♦ 

(d) The second derivative of equation (9) has elements 



10 



15 



^^ = ^diag{S hB p^p h p B }x (All) 



where 



h = g 
otherwise 



Component weights which maximise the posterior distribution 
of the likelihood function may be specified using an EM 
algorithm comprising an E step and an M step. 

20 In conducting the EM algorithm, the E step preferably 
comprises the step computing a term of the form; 

<7-l n 



*- r 1=1 (Alia) 

C7-1 n 

25 where = 2?{l/< I , ^(d^d^d^Y and ^ = 1/^ = 0 if 4 = 0. 

Preferably, equation (11a) is computed by calculating the 
conditional expected value of t ig 2 =l/r ig 2 when p\P ig \^) is 
N[0,T%} and p[t%) has a specified prior distribution. 
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Explicit formulae for the conditional expectation will be 
presented later* 



5 Typically , the EM algorithm comprises the steps: 



(a) performing an E step by calculating the 

conditional expected value of the posterior 
distribution of component weights using the 
10 function: 



where xf fi g = $ P g y g in equation (8) , d(y g )^P^d g , and 
15 is de£ined as in equation (11a) evaluated at 

P g —P g fg- Here P g is a matrix of zeroes and ones 
derived from the identity matrix such that P^P g 
selects non-zero elements of p g which are denoted 
by y z . 



20 



(b) performing an M step by applying an iterative 
procedure to maximise Q as a function of y 
whereby : 



25 = y'-cr f 



1 An \ 

(A13) 



where a' is a step length such that 0Scr f <l; and y m (y g , 
cf—l. / ~. * • , G-2.) . 

30 Equation (A12) may be derived as follows: 

Calculate the conditional expected value of (A5) given the 
observed data y and a set of parameter estimates ft . 
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Q = Q(fi\yj)<=E{togp(0,T\y)\yJ) 



10 



Consider the case when components of p (and p ) are set to 

zero i.e for g = l,..., G-l, fi g =P g y g and fi g = P g f g ' 

Ignoring terms not involving yand using (A4) , (A5) , (A9) we 
get: 



<2 = logI 



(A14) 



where xT f} g = x?P g f g in (A8), d g (y g )=P g d g where is defined 

15 as in equation (Alia) evaluated at fi s = p s f g * 

Note that the conditional expectation can be evaluated from 
first principles given (A4) ♦ Some explicit expressions are 
given later. 

20 

The iterative procedure may be derived as follows: 

To obtain the derivatives required in (11), first note that 

front (A8), (A9) and (A10) writing d(y) = {d g (f g ),g = 1, ~,G-l} , we 

get 



-</wg{rf(y)}"V 



(A15> 
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where 



= <fca£ [S ghPg - p g p„ } , 

s ={ 1 ' * =A 

** [0, otherwise 



+diag{d(f)}' 2 



(A16) 



10 and 



^=P/X T ,g = l,...G-l 



(A17) 



In a preferred embodiment, the iterative procedure may be 
15 simplified by using only the block diagonals of equation 
(A16) in equation (A13) . For g = \,...G-l, this gives* 

yf = y> +«' [x T ^ eg X g + diag{d g (f g )}- 2 }' 1 [x T g (e g -P g )-diag{d g (f g )}y' g } - 



20 



(A18) 



Rearranging equation (A18) leadB to 



(A19) 
25 where 



Yg r =diag{d g (y g )\Xl 
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Writing p(g) for the number of columns of Y g , (A19) requires 
the inversion of a p(g)xp(g) matrix which inay be quite 
large* This can be reduced to an nxn matrix for p(g)>n by 
5 noting that: 

{Y^„Y t +l)- l =I-r g {Ytf+^)- X Y g (A2Q) 



where Z g =A^I^. Preferably, (A19) is used when p(g)>n and 
10 (A19) with (A20) substituted into equation (A19) is used 
when p{g)<*n. 

Note that when r? has a Jeffreys prior we have: 

15 E{t»|jy-l/j§» 



In one embodiment, t^=l/r? has an independent gramma 
distribution with scale parameter b>0 and shape parameter 
k>0 so that the density of t£ is: 



Omitting subscripts to simplify the notation, it can be 
shown that 

25 

E{ t 2 | 0 } = (2k+l)/(2/b + 0 2 ) (A21) 



as follows : 



30 Define 

I(p,bJ0= J(t 2 ) p t exp(^).5/? 2 t 2 )')(t 2 > bjc)dt 2 
o 



then 
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I(p,b Jc^b^* 5 {rcp+k+o.s^^xi-fo.sb/s 2 )^ 0 ^ 



Proof 

5 Let s = p 2 ll then 

CO 

I(p f hjc>* pias J(t 2 /b)^ as expC-st 2 )^! 2 ^^! 2 
p 

Now using the substitution u = t 2 /b we get 

CO 

IfobJcH) 1 * 0 5 \(vLf° s exp(-sub>y(u,l,k)du 

0 

Now let s'=bs and substitute the expression for *y(u,l,k) . This 
10 gives 

CO 

UpAkH** 11 " J expKsM-lWu^^du / T(k) 

o 

Looking up a table of Laplace transforms, eg Abramowitz and 
Stegun, then gives the result. 

15 The conditional expectation follows from 

E{t 2 |/5}=I(l,b,k)/I(0,b,k) 
= (2k+l)/(2/b + j8 2 ) 



As k tends to zero and b tends to infinity we get the 
20 equivalent result using Jeffreys prior* For example, for 
k=0.005 and b=2*10 5 

E{t 2 |j3}=(L01)/(10' 5 +/S 2 ) 

Hence we can get arbitrarily close to the Jeffreys prior 
25 with this proper prior. 



The algorithm for this model has 
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•0.5 



d=E{t 2 | 0} 
=E{J-| fa** 

T 

where the expectation is calculated as above. 

In another embodiment, rf g has an independent gramma 
5 distribution with scale parameter b>0 and shape parameter 
k>0 . It can be shown that 



Ju k - 3/2I exp(-(Vu+ii))du 



b Ju k - ,/2l expH7 fg /u +u))du 

o 

2 1 K 3/24c (2 A fi~) 



(A22) 

b|0 i8 |K l/Uc (2^) 

1 (27^)K 3 ^(2^) 

l&J 2 ^(2^) 



10 where y is =fi^/2b and K denotes a modified Bessel function. 
For ksi in equation (A22) 

E{<|j8}=V2/b(l/|/3J) 

15 For K=0 . 5 in equation (A22) 

E{<|/9„>=V2A>(l/|i3J){K I (2 > /7, e )/K 0 (2 > /^)} 
or equivalent ly 

E{< (1/I0J 2 ){2 N / Yi K 1 (2^Y l8 ^(2^ )} 



Proof of (A.l) 

From the definition of the conditional expectation, writing 
25 y=0 2 /2b, we get 
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J r V l exp(- jt* )b" ! ( r 2 /b) w exp(r 2 /b)dT 2 
E{r 2 |/3}=^ 

0 

Rearranging, simplifying and making the substitution u=r 2 /b 
gives the first equation in (A22) . 

5 The integrals in (22) can be evaluated by using the result 

where K denotes a modified Bessel function, see 
Watson (1966) . 

10 

Examples of members of this class are Jed in which case 

E{T { ; 2 \ff ls }=^bcm,D 

15 which corresponds to the prior used in the Lasso technique, 
Tibshirani (1996) . See also Figueiredo(2001) . 

The case k=0.5 gives 

20 E{r£|0 lg }« V^(1/|^|){K^2^)/K 0 (2^)} 

or equivalently 

E{< |ft g }= (l/|/? ig | 2 ){2V^K 1 (2^)/K 0 (2^)} 

25 

where K 0 and K x are modified Bessel f mictions, see Abramowitz 
and Stegun(1970) . Polynomial approximations for evaluating 
these Bessel functions can be found in Abramowitz and 
Stegun(1970, p379) . The expressions above demonstrate the 
30 connection with the Lasso model and the Jeffreys prior 
model . 



WUU4iU4tK>b tnn p://vvww.geunepaient.cofii/i-ogiii,uu^/ot>auuv.tuuu i/rem»wu^ mtu^.^i t.^.i.wo^.o- , ~ 1 



WO 2004/104856 PCT/AU2004/000696 

- 37 - 



It will be appreciated by those skilled in the art that as k 
tends to zero and b tends to infinity the prior tends to a 
Jeffreys improper prior. 

5 

In one embodiment , the priors with 0< k Si and b>0 form a 
class of priors which might be interpreted as penalising non 
zero coefficients in a manner which is between the Lasso 
prior and the specification using Jeffreys hyper prior, 

10 

The hyperparameters b and k can be varied to control the 
n umb er of components selected by the method. As k tends to 
zero for fixed b the number of components selected can be 
decreased and conversely as k tends to 1 the number of 
15 selected components can be increased. 

In a preferred embodiment , the EM algorithm is performed as 
follows : 

20 1- Set n=0 , =/ and choose an initial value for y°.Choose a 
value for b and k in equation (A22) . For example b»le7 and 
k=*0 gives the Jeffreys prior model to a good degree of 
approximation. This is done by ridge regression of 
log(pi g /pio ) on xi where p ig is chosen to be near one for 

25 observations in group g and a small quantity >0 otherwise 

- subject to the constraint of all probabilities summing 
to one. 

2. Do the E step i.e evaluate £? = £2(r| yj") • Note that this 
30 also depends on the values of k and b. 

3. Set t=0. For g = l,...,G-l calculate: 

a) Sg^r'f-r'g using (A19) with (A20) substituted into 
(A19) when pfgj*>n. 
35 (b) Writing 8* =(^ ) ^ = 1,„. > 0-1) Do a line search to find the 

value of a' in y M = y* ^ofS 1 which maximises (or simply 
increases) (12) as a function of a* . 
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c) sot y' +l =y' and t=t+l 



Repeat steps (a) and (b) until convergence. 

5 This produces 7*" +l say which maximises the current Q function 
as a function of y . 



Where £<zlI , say 10~ 5 . Define P g so that P ig =0 for ze^and 

10 



This step eliminates variables with small coefficients from 
the model. 

15 

4. Set n=n+l and go to 2 until convergence. 

A second embodiment relating to an ordered cat logistic 
regression will now be described. 

20 

B. Ordered categories model 

The method of this embodiment may utilise the training 
samples in order to identify a subset of components which 

25 can be used to determine whether a test sample belongs to a 
particular class. Por example, to identify genes for 
assessing a tissue biopsy sample using microarray analysis, 
microarray data from a series of samples from tissue that 
has been previously ordered into classes of increasing or 

30 decreasing disease severity such as normal tissue, benign 

tissue, localised tumour and metastasised tumour tissue are 
used as training samples to identify a subset of components 
which is capable of indicating the severity of disease 
associated with the training samples. The subset of 

35 components can then be subsequently used to determine 
whether previously unclassified test samples can be 
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classified as normal, benign, localised tumour or 
xnetastasised tumour. Thus, the subset of components is 
diagnostic of whether a test sample belongs to a particular 
class within an ordered set of classes* It will be apparent 
5 that once the subset of components have been identified, 
only the subset of components need be tested in future 
diagnostic procedures to determine to what ordered class a 
sample belongs. 

10 The method of the invention is particularly suited for the 
analysis of very large amounts of data. Typically, large 
data sets obtained from test samples is highly variable and 
often differs significantly from that obtained from the 
training samples. The method of the present invention is 

15 able to identify subsets of components from a very large 
amount of data generated from training samples, and the 
subset of components identified by the method can then be 
used to classifying test samples even when the data 
generated from the test sample is highly variable compared 

20 to the data generated from training samples belonging to the 
same class. Thus, the method of the invention is able to 
identify a subset of components that are more likely to 
classify a sample correctly even when the data is of poor 
quality and/or there is high variability between samples of 

25 the same ordered class. 

The components are ^predictive" for that particular ordered 
class. Essentially, from all the data which is generated 
from the system, the method of the present invention enables. 

30 identification of a relatively small number of components 

which can be used to classify the training data. Once those 
components have been identified by this method, the 
components can be used in future to classify test samples. 
The method of the present invention preferably utilises a 

35 statistical method to eliminate components that are not 

required to correctly classify the sample into a class that 
is a member of an ordered class. 
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In the following there are N samples, and vectors such as y, 
z and ii have components yi, zi and Pi for i = l, N. Vector 
xnultiplication and division is defined componentwise and 
5 diag{ ' } denotes a diagonal matrix whose diagonals are 
equal to the argument. We also use | | * | | to denote 
Euclidean norm. 

Preferably, there are N observations yt where y\ takes 
10 integer values 1,...,G. The values denote classes which are 

ordered in some way such as for example severity of disease. 

Associated with each observation there is a set of 

covariates (variables , e.g gene expression values) arranged 

into a matrix X* with n row and p columns wherein n is the 
15 samples and p the components. The notation x t denotes the 

i th row of X*. Individual (sample) i has probabilities of 

belonging to class k given by /r tt »jt 4 (j£) . 

Define cumulative probabilities 

k 

Xa^Z^ ' k - 1 '~' G 

20 Note that is just the probability that observation i 

belongs to a class with index less than or equal to k. Let C 
be a N by p matrix with elements c & given by 



30 



* if observation i in class j 

Q .... 



-ii. i 
-(/ "~ \ 0, otherwise 

25 and let R be an n by P matrix with elements r 9 given by 



These are the cumulative sums of the columns of C within 
rows. 

For independent observations (samples) the likelihood of the 
data may be written as: 
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and the log likelihood L may be written ast 



5 



The continuation ratio model may be adopted here as follows: 



15 The likelihood is equivalent to a logistic regression 

likelihood with response vector jyand covariate matrix X 



where 7^, is the G-l by G-l identity matrix and \ G _ X is a Q-l 
20 by 1 vector of ones* 

Here vec{ } takes the matrix and forms a vector row by row. 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the a priori 
25 assumption that most of the component weights are zero. 




for k a 2,~»,G , see McCullagh and Nelder(1989) and 
McCullagh(198 0) and the discussion therein. Note that 




y~vec{R) 



Following Flguel redo (2001) , in order to eliminate redundant 
variables (covariates) , a prior is specified for the 
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parameters fi* by introducing a p x 1 vector of 
hyperparameters . 

Preferably, the prior specified for the component weights is 
5 of the form 

#OJ*(*V)*H** (B1) 

where |v 2 ) is jv(o,diag{v 2 ]) and p(y 2 ) is a suitably chosen 

n 

hyperprior. For example, p(v 2 )&YL p ( v ?) ±B a suitable form of 
10 Jeffreys prior. 

In another embodiment, p(v?) is a prior wherein tf^llvf has 
an independent gamma distribution. 

15 In another embodiment, p{ y f) is a P^ioar wherein vf has an 
independent gamma distribution. 

The elements of theta have a non informative prior. 

20 Writing Z^j/ff**?) for the likelihood function, in a Bayesian 

framework the posterior distribution of /? , 0 and v given 
y is 

p(P<Pv\y) cc L(y\p*q>)p(py)p{v) (2) 

25 

By treating v as a vector of missing data, an iterative 
algorithm such as an EM algorithm (Dempster et al, 1977) can 
be used to maximise (2) to produce maximum a posteriori 
estimates of P and 8. The prior above is such that the 
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maximum a posteriori estimates will tend to be sparse i.e. 
if a large number of parameters are redundant, many 
components of p* will be zero. 

5 Preferably /i T =(0 T ,P* T ) in the following: 

For the ordered categories model above it can be shown that 

frx l (y -n) (ID 

dp 

E{0}=-X'diag{//(1-//)}X (12) 
op 

10 Where exp(x^)/(l + exp(x^))and fi r « (9 z ^9 C9 p* T ) - 

The iterative procedure for maximising the posterior 
distribution of the components and component weights is an 
EM algorithm, such as, for example, that described in 
15 Dempster et al, 1977. Preferably, the EM algorithm is 
performed as follows: 

1. Chose a hyperprior and values b and k for its parameters. 
Set n=0, S 0 » {1,2,..., p } , <|> (0) , and e =10- s (say). Set the 
20 regularisation parameter k at a value much greater than 1, 
say 100. This corresponds to adding 1/k 2 to the first G-l 
diagonal elements of the second derivative matrix in the M 
step below. 

25 If pi N compute initial values p* by 

0*KX l X+My ! X T g(y*-£) (B2) 

and if p > N compute initial values p* by 



30 



-X T (XX T +XI)" 1 X)X T g(yf £) (B3 ) 

where the ridge parameter X satisfies 0 < A. £ 1 and £ is 
small and chosen so that the link function g{z) = log(z/(l-z)) is 
well defined at z « y+£ . 
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2 . Define 



* 0, otherwise 



and let P n be a matrix of zeroes and ones such that the 
nonzero elements y (n> of p (n) satisfy 

Define = (w fii9 i -hp) i such that 

f 1. i ^ G 
* 1 0, otherwise 

and let w y =P n w fi 

15 

3 • Perform the E step by calculating 

Q(PIP <n) ,<P (n) )=E{IogpO,<p,v|y)| y,/3< n >,^ n >} 

- LCylA^-OJfllO^w^/awif ) 

where L is the log likelihood function of y and 

20 c? fe = £{1 / i£ | P^}-** . d g = (d 1g , d 2g , d pg f and for convenience we 

define d ig = Ud lg - 0 if fi ig = 0. . Using 0 = P„ Y and 0» = P,t» (15 ) 
can be written as 

Q(Y|Y W ,* W ) = Hy|P.7.^H>.5 ai(7*w r yd(V n >)|| 2 ) (B4) 

25 

with d(y ln) )=Pjd M evaluated at y ff<">=py ) . 



4. Do the M step. This can be done with Newton Raphson 
iterations as follows. Set y 0 = y w and for r=0,l,2,„. y r+i 
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= Yr + ot r 8 r where <x r is chosen by a line search algorithm to 

ensure Q(7i*i I V n) » $ (n) ) > QCY r I V B \ # (n) ) • 
For p <> N use 

Is 

6 r = A(d*(Y (n) ))[Y n T V r -'Y n+ ir(Y n T Zr -^y) (B5) 

5 where 

k 9 otherwise 



Y D T =A(d'(y (n) ))P n T X T 
10 V/'-diagOai-AO) 

z, =(y-fO 

and /t r = exp( XP n -y r )/(l+exp( XP 0 y r )) 
For p > N use 



15 



20 



w y 

8= A(cf (y< n) ))P -Y D T (Y 0 Y7+V r r YJ0C V ^^) (B6) 

with V r and z r defined as before. 

Let y* be the value of y r when some convergence criterion is 
satisfied e.g 

| | y r - y r+1 | | < e (for example 1(T 5 ) • 

25 5. Define p* - P n y* , S^={i^G: | ft | >max(|/3. ) } where e x is a 
small constant, say le-5 . Set n=n+l . 

6, Check convergence. If | | y* - y (n) | | < e 2 where e 2 is 
suitably small then stop, else go to step 2 above. 

3 0 Recovering the probabilities 

Once we have obtained estimates of the parameters p are 
obtained, calculate 
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A 
7ft 



for i=l,...,N and k = 2,„.,G. 



Preferably, to obtain the probabilities we use the recursion 



*iG = a iG 



\ a ik - 



\(\-a ik )n ik 



and the fact that the probabilities sum to one, for i = 
1,...,N. 



In one embodiment the covariate matrix X 
10 ,with rows xi T can be replaced by a matrix K with ij element 
kij and k ±j s k( x± - Xj ) for some kernel function K. This 
matrix can also be augmented with a vector of ones* Some 
example kernels are given in Table 1 below, see Evgeniou et 
al(1999) . 

15 



Kernel function 


Formula for K( x - y ) 


Gaussian radial basis function 


exp( - | x - y | | 2 / a) , a>0 


Inverse multiquadric 


(|| x - y || 2 + c 2 r 1/2 


multiquadric 


(1 x - y | | 2 + c 2 ) M 


Thin plate splines 


II * - y ll 2a+1 

|| x - y || 2 »ln<|| x - y ||) 


Multi layer perceptron 


tanh( x'y-9 ) , for suitable 0 


Ploynomial of degree d 


(1 + xy ) d 


B splines 


B 2 n+i(x - y) 


Trigonometric polynomials 


sin{( d +1/2 ) <x-y))/s±n<(x- 
y) /2) 



Table 1: Examples of kernel functions 
In Table 1 the last two kernels are preferably one 
dimensional i.e. for the case when X has only one column. 
2 0 Multivariate versions can be derived from products of these 
kernel functions. The definition of B 2n *i can be found in De 
Boor (1978). Use of a kernel function results in mean values 
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which are smooth (as opposed to transforms of linear) 
f xinc t ions of the covariates X. Such models may give a 
substantially better fit to the data* 

5 A third embodiment relating to generalised linear models 
will now be described. 

C. Generalised Linear Models 

10 The method of this embodiment utilises the training samples 
in order to identify a subset of components which can 
predict the characteristic of a sample. Subsequently , 
knowledge of the subset of components can be used for tests, 
for example clinical tests to predict unknown values of the 

15 characteristic of interest. For example, a subset of 

components of a DNA microarray may be used to predict a 
clinically relevant characteristic such as, for example, a 
blood glucose level, a white blood cell count, the size of a 
tumour, tumour growth rate or survival time. 

20 

In this way, the present invention identifies preferably a 
relatively small number of components which can be used to 
predict a characteristic for a particular sample. The 
selected components are "predictive" for that 

25 characteristic. By appropriate choice of hyperparameters in 
the hyper prior the algorithm can be made to select subsets 
of varying sizes. Essentially, from all the data which is 
generated from the system, the method of the present 
invention enables identification of a small number of 

30 components which can be used to predict a particular 

characteristic. Once those components have been identified, 
by this method, the components can be used in future to 
predict the characteristic for new samples. The method of 
the present invention preferably utilises a statistical 

35 method to eliminate components that are not required to 
correctly predict the characteristic for the sample. 
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The inventors have found that component weights of a linear 
combination of components of data generated from the 
training samples can be estimated in such a way as to 
eliminate the components that are not required to predict a 
5 characteristic for a training sample. The result is that a 
subset of components are identified which can correctly 
predict the characteristic for samples in the training set. 
The method of the present invention thus permits 
identification from a large amount of data a relatively 
10 small number of components which are capable of correctly 
predicting a characteristic for a training sample, for 
example, a quantity of Interest. 

The characteristic may be any characteristic of interest. In 

15 one embodiment, the characteristic is a quantity or measure. 
In another embodiment, they may be the index number of a 
group, where the samples are grouped into two sample groups 
(or "classes") based on a pre - determined classification. The 
classification may be any desired classification by which 

20 the training samples are to be grouped. For example, the 

classification may be whether the training samples are from 
a leukemia cell or a healthy cell, or that the training 
samples are obtained from the blood of patients having or 
not having a certain condition, or that the training samples 

25 are from a cell from one of several types of cancer as 
compared to a normal cell . In another embodiment the 
characteristic may be a censored survival time, indicating 
that particular patients have survived for at least a given 
number of days. In other embodiments the quantity may be any 

30 continuously variable characteristic of the sample which is 
capable of measurement, for example blood pressure. 

In one embodiment, the data may be a quantity y i , where 
i 6 . We write the nxl vector with elements y t as y. We 

35 define a p x 1 parameter vector P of component weights (many 
of which are expected to be zero) , and a q x 1 vector of 
parameters 9 (not expected to be zero) • Note that q could be 
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zero (i.e. the set of parameters not expected to be zero may 
be empty) • 

In one embodiment, the input data is organised into an 



components. Typically, p will be much greater than n . 

In another embodiment, data matrix X may be replaced by an n 
x n kernel matrix K to obtain smooth functions of X as 
10 predictors instead of linear predictors. An example of the 
kernel matrix K is k±j=exp(-0 .5* (Xi-xj) fc (Xi~Xj) /a 2 ) where the 
subscript on x refers to a row number in the matrix X. 
* Ideally, subsets of the columns of K are selected which give 
sparse representations of these smooth functions* 

15 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the apriori 
assumption that most of the component weights are zero. 

20 In one embodiment, the prior specified for the component 
weights is of the forms 



5 »x/?data matrix X = \x~} with n test training samples and p 



P(fi)=jp(f}\v 2 )p(v 2 )dv 2 



(CI) 



25 




A suitable form of hyperprior is. 




In another embodiment, the hyperprior p\y 2 ) is such that 
30 each t^l/^has an independent gamma distribution. 



In another embodiment, the hyperprior J?(v 2 ) is such that 
each vf has an independent gamma distribution. 
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Preferably, an uninformative prior for 9 is specified. 

The likelihood function is defined from a model for the 
5 distribution of the data. Preferably, in general, the 

likelihood function is any suitable likelihood function. For 
example, the likelihood function L[y\/3p) may be, but not 
restricted to, of the form appropriate for a generalised 
linear model (GLM) , such as for example , that described by 
10 Nelder and Wedderburn (1972) . Preferably, in this case, the 
likelihood function is of the form: 

L = log p(y 1 0, *) »f {Mz*ffll + } (C2 ) 

a ( (<j>) 



where y » (yi,.~, y«x) T and a A (0) = 4> /wi with the w A being a 
15 fixed set of known weights and <f> a single scale parameter. 

Preferably, the likelihood function is specified as follows: 
We have 

E{y i } = b'(e 1 ) 

2 0 Var{y} = VXBfaW - 7? a s (0) (C3 ) 

Each observation has a set of covariates x.± and a linear 
predictor tji = xi T 0. The relationship between the mean of 
the i th observation and its linear predictor is given by the 
link function 771 = g{j*i) s g( b' (0±) ). The inverse of the 
25 link is denoted by h, i.e 
Mi = b' (6 ± ) « h(r?i) . 

In addition to the scale parameter, a generalised linear 
model may be specified by four components: 
30 • the likelihood or (scaled) deviance function, 

• the link function 

• the derivative of the link function 

• the variance function. 
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Some common examples of generalised linear models are given 
in the table below. 



Distribution 


Link function 
3<M> 


Derivative 
of link 
function 


Variance 
function 


Scale 
parame 
ter 


Gaussian 


M 


1 


1 


Yes 


Binomial 


log(M/U- A*)) 


l/( fid- ft)) 


A*d- 
A*)/n 


No 


Poisson 


log (fx) 


1/ A* 


M 


No 


Gamma 


1/ V 


-1/ At 2 


M 2 


Yes 


Inverse 
Gaussian 


1/ A* 2 


-2/ M 3 


A* 3 


Yes 



5 In another embodiment a quasi likelihood model is specified 
wherein only the link function and variance function are 
defined. In some instances , such specification results in 
the models in the table above. In other instances, no 
distribution is specified. 

10 

In one embodiment, the posterior distribution of fi 9 and v 
given y is estimated using: 

p(P<pv\y) a L(y\P<p)p(p\v)p(y) (C4) 

15 

wherein L[y\p<p} is the likelihood function. 

In one embodiment, v may be treated as a vector of missing 
data and an iterative procedure used to maximise equation 
20 (C4) to produce maximum a posteriori estimates of 0. The 

prior of equation (CI) is such that the maximum a posteriori 
estimates will tend to be sparse i.e. if a large number of 
parameters are redundant, many components of P will be zero. 

25 As stated above, the component weights which maximise the 

posterior distribution may be determined using an iterative 
procedure. Preferable, the iterative procedure for 
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maximising the posterior distribution of the components and 
component weights is an EM algorithm comprising an E step 
and an M step, such as, for example, that described in 
Dempster et al, 1977* 

5 

In conducting the EM algorithm, the E step preferably 
comprises the step of computing terms of the form 

P=^E{/Sf/v^|/8 i } 

w (C4a) 

-iff/* 

F-l 

where d { = E{l/vf \ j} { }~* s and for convenience we define 

10 <*, = !/<?, = 0 if 0 { -O. In the following we write d = d(/3) = (d l ,d 2y .„ 9 d n ) T . 
In a similar manner we define for example d(J3 M ) and 
d(y™)=P?d(P n y in) ) where fi™=py n) and P„ is obtained from the p 
by p identity matrix by omitting columns j for which /?j rt> ~0. 

15 Preferably, equation (C4a) is computed by calculating the 

conditional expected value of t?—Vvf when /?(A|v*j is 7V(o,v?) 

and p{v?) bas a specified prior distribution. Specific 
examples and formulae will be given later. 



20 In a general embodiment, appropriate for any suitable 

likelihood function, the EM algorithm comprises the steps: 

(a) Selecting a hyperprior and values for its 
parameters. Initialising the algorithm by 
setting n=0, S 0 ~ {l,2,~., p } , initialise <p (0> 

25 , P* and applying a value for e, such as for 

example e = 10' 5 ; 

(b) Defining 

0M= | tf> ieS « (C5) 
0, otherwise 

and let Pn be a matrix of zeroes and ones such 
30 that the nonzero elements y (n) of JJ (n) satisfy 
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(c) performing an estimation (E) step by calculating 

the conditional expected value of the posterior 
distribution of component weights using the 
function: 

Q(P I P w , <P (D) ) = E{ Iogp(p, <p,v|y)| y, j8», 0 (n) } {C6 } 

= L(y | ft 0< n) ) -0.5 /? r A(rf(/? (fl) ))- 2 /? 



10 where L is the log likelihood function of y» 

Using /S=P n 7and d(V n> ) as defined in (C4a) , (C6> 
can be written as 



15 



Q( 7 | y*, 0<">) = L(y|P n y, &*yO.Sy T Md(y in) )r 2 r <C7) 



(d) performing a maximisation (M) step by applying 

an iterative procedure to maximise Q as a 
function of y whereby y 0 = Y (n) and for r=0,l,2, 
y r+1 = y r + otr 5r and where ar is chosen by a 
20 line search algorithm to ensure 

Q(7r*i I V n> > ^ (B) ) > 0(7, 1 V^ 00 ), ^ 

5=A(d( Y < n >))[-A(d( Y (n> ))J^A^^ ( C 8) 
25 where: 

d(V n) )=P n T rf(Py ,i> )as in (C4a); and 

for /8 r = P„7 r . 

30 
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(e) Let y* be the value of yr when some convergence 

criterion is satisf ied, for example, | | yr - 
yr+l| | < 6 (for example 1(T 5 ); 

5 (f) Defining p* = P n y* , S n+1 -{i: |ft | >max(|fl j |*£ 1 )} 

where £i is a small constant, for example le-5. 

(g) Set n=n+l and choose <p (n+1) = <p (n) +K n ( q>* - <p (n> ) 

d 

where q>* satisfies — L(y | P n y\#) = 0 and K n is a 

dtp 

10 dancing factor such that 0< K n < 1; and 

(h) Check convergence . Xf | j y* - y (n> | | < e 2 where e 2 
is suitably small then stop, else go to step (b) 
above. 

15 

In another embodiment, t? =l/v/ has an independent gamma 
distribution with scale parameter b>0 and shape parameter 
k>0 so that the density of tf is 

2 0 ^tf ,b^)=b ^t?/b) k l exp(-t?/byr(k) 

It can be shown that 

E{t 2 |/3}= (2k+l)/(2/b + jff 2 ) 

as follows: 

Define 



25 



CO 

3 0 I(p,b,k)= J(t 2 ) p t exp(-0.5/3VyKt 2 ,b,k)dt J 



o 

then 



I(p,b^)H>'^ as {r(p+k-K).5)/r(k)}(l+0.5b/3 a )' <r ' k+as) 
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Proof 

Let s = J3 2 /2 then 

%>,b,k)=* pHU f(t 2 /b) FH) - 5 exp(.st 2 h(t 2 ,b4c)dt 2 

0 

5 Now using the substitution u = t 2 /b we get 

co 

Kp.b.k)^* 1 ^ {(u) 1 ^- 5 exp(-sub)-y(u,l,k)du 
o 

Now let s'=bs and substitute the expression for 7(11, 1, k) . This 
gives 

10 

00 

ICp^kH^' 5 J expKsM^u^^'du / T(k) 

0 

Looking up a table of Laplace transforms, eg Abramowitz and 
Stegun, then gives the result. 

15 The conditional expectation follows from 

E{t 2 |/3}=I(l,b^)/I(0,b,k) 
= (2kH)/(2/b + 0 2 ) 

As k tends to zero and b tends to infinity we get the 
20 equivalent result using Jeffreys prior. For example, for 
k=0.005 and b=2*10 5 

E{t 2 |/3} = (L01)/(10- 5 + /3 2 ) 

Hence we can get arbitrarily close to the algorithm with a 
25 Jeffreys hyperprior with this proper prior. 

In another embodiment, vf has an independent gamma 
distribution with scale parameter b>0 and shape parameter 
k>0 . It can be shown that 
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Ju^-'expG^/u+ufldu 
Ety-a^JL 

b Ju^^'expC-^/u +u))du 
0 

= (2J_I^(2£) . 

VblAlK^C^) 



1 (2^)^(2^) 

1 0, | 2 iw~ 4> 



where X,=/3, 2 /2b and K denotes a modified Bessel function, 
which can be shown as follows: 
For k=l in equation (c9) 



E{^|ft>=^b(l/|ftD 
For Ks0.5 in equation (C9) 
10 B{»,*|ft>- ^?b(l/|ft|){K I (2 > /^yK 0 (2 > /^)} 

or equivalently 

E{J^|A>«(l/|ftf ){2^K 1 (2V^)/K 0 (2^)} 

15 

Proof 

From the definition of the conditional expectation, writing 
A^$}l2h, we get 

20 EK 2 [ft} ^ _____ 



0 

Rearranging, simplifying and making the substitution u-yf/b 
gives A.l 

The integrals in A*l can be evaluated by using the result 



25 
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fix =^K t (2a) 



where K denotes a modified Beasel function, see 
Watson (1966) . 

5 Examples of members of this class are k=l in which case 

E{^ 2 |/3 i }=^b(l/|/? i |) 

which corresponds to the prior used in the Lasso technique, 
10 Tibshirani(1996) • See also Pigueiredo (2001) . 

The case k=0.5 gives 

E0V 2 |/3,}= ^b(l/|ft|){K,(2V^)/K 0 (2^)} 

15 

or equivalently 

E{f,-- 2 |/5 i }= (1« ){2^Tk i (2 > /^>K 0 (2^")} 

20 where K 0 and K x are modified Bessel functions, see Abramowitz 
and Stegun(1970) . Polynomial approximations for evaluating 
these Bessel functions can be found in Abramowitz and 
Stegun(1970, p379) . Details of the above calculations are 
given in the Appendix. 

25 

The expressions above demonstrate the connection with the 
Lasso model and the Jeffreys prior model. 

It will be appreciated by those skilled in the art that as k 
3 0 tends to zero and b tends to infinity the prior tends to a 
Jeffreys improper prior. 

In one embodiment, the priors with 0< k £1 and b>0 form a 
class of priors which might be interpreted as penalising non 
35 zero coefficients in a manner which is between the Lasso 

prior and the original specification using Jeffreys prior. 
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In smother embodiment , in the case of generalised linear 
models, step (d) in the maximisation step may be estimated 



3 2 L 9 2 L 
by replacing -r — with its expectation E{ — — } 



This is 



preferred when the model of the data is a generalised linear 
model. 



9 2 L 

For generalised linear models the expected value E{ — — } may 

d y r 

be calculated as follows. Beginning with 



10 



dp 77 dti, a, (0) 



(C10) 



where X is the N by p matrix with i th row x± T and 



15 



(Cll) 



we get 



E(|i,=-X'ACa,( 9 )z?cgWX 



20 Equations (C10) and (Cll) can be written as 



E{-^}=-X'V' , X 



where V=A(a i (<p)xf(| 5i -) 2 ) 



(C12) 



(C13) 



25 
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Preferably, for generalised linear models, the EM algorithm 
comprises the steps: 

(a) Choose a hyper prior and its parameters. 
Initialising the algorithm by setting n=0, S 0 = 
{l,2,~, p } , <p (0) / applying a value for e, such as 
for example e ■ 10~ s , and 

If p S N compute initial values p* by 

f =(X l X+M)- , X T g(y+^ ( C14 ) 

and if p > N compute initial values p* by 

£•=1(1 -X T (XX T +XQ* l X)X T g^H) (C15) 

where the ridge parameter X, satisfies 0 < X <> 1 and £ 
is small and chosen so that the link function is well 
defined at y+£ • 

(b) Defining 



10 



15 



20 



1 0, otherwise 



and let Pn be a matrix of zeroes and ones such 
that the nonzero elements y{n) of p(n) satisfy 



<c) performing an estimation (E) step by calculating 
25 the conditional expected value of the posterior 

distribution of component weights using the 
function : 

QO I P (n) , <P (n) ) = E{ logp(P, q> , v | y) | y, , j*} ^ } 

- Uy | /S, 0 (O) )-O.5 fi T A(d(0 M )y 2 fl 



30 
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where L is the log likelihood function of y. 
Using /3=P n Yand ff* = P n Y n) (C16) can be written 
as 



5 QCH y n) , 4> (a) ) - Uy i p d 7, * (n) >o.s / 'A{d(/*yr 2 r (ci? ) 

(d) performing a maximisation (M) step by applyingr an 
iterative procedure, for example a Newton Raphson 
iteration, to maximise Q as a function of y whereby 
10 Yo - Y <n) and for r»0,l,2,~ Yr*i * Yr + a r 8 r where a r 

is chosen by a line search algorithm to ensure 
QCU.I iP.**) > Q(7 r |V B) »* (D) ), and for 
p £ N use 



15 8,= A(d(r w ))[Y n T V; , Y n +I3- , (Y n T V r - , z r -^y) (CIS) 

where 



25 



Y n =A(d(y<">))P n T X 



20 V=A( ai ((p)T?(f!i) 2 ) 



dfi 

and the subscript r denotes that these quantities are 
evaluated at /i = h(XP„Y r ) . 
For p > N use 

5 =A(d(Y w ))[I ^(^Y^V.^YJCYJV; 1 ^--^) (C19) 
with V r and z r defined as before. 



(e) Let Y* be the value of Yr when some convergence 
criterion is satisfied e*g 

| | Yr - Yr*i| | < e (for example 10~ s ) . 
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(f) Define p* = P n y* , S n+1 ={i: | ft | >max(|j3 j | *e, ) } where 

61 is a small constant, say le-5. Set n=n+l and 
choose <p n+1 = <p n + iQx( <p* - <p n ) where satisfies 
— L(y|P n 7\0) = O and K n is a damping factor such that 

5 0< Kn 1. Note that in some cases the scale parameter 

is known or this equation can be solve explicitly to 
get an updating equation for q>. 

The above embodiments may be extended to incorporate quasi 
10 likelihood methods Wedderburn (1974) and McCullagh and 

Nelder (1983)). In such an embodiment, the same iterative 
procedure as detailed above will be appropriate, but with L 
replaced by a quasi likelihood as shown above and, for 
example. Table 8.1 in McCullagh and Nelder (1983). In one 
15 embodiment there is a modified updating method for the scale 
parameter <p. To define these models requires specification 
of the variance function t 2 , the link function g and the 

derivative of the link function — . Once these are defined 



20 



the above algorithm can be applied. 

In one embodiment for quasi likelihood models, step 5 of the 
above algorithm is modified so that the scale parameter is 
updated by calculating 

where Ji and x are evaluated at p* = P n y* • Preferably, this 
updating is performed when the number of parameters s in the 
model is less than N. A divisor of N -s can be used when s 
30 is much less than N. 
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In another embodiment , for both generalised linear models 
and Quasi likelihood models the covariate matrix X with rows 
xi T can be replaced by a matrix K with ij th element k±j and 
kij « K(xi-Xj) for some kernel function K . This matrix can 
5 also be augmented with a vector of ones. Some example 
kernels are given in Table 2 below, see Evgeniou et 
al<1999) . 



Kernel function 


Formula for 1C( x - y ) 


Gaussian radial basis 
function 


exp(-||x-y|| 2 /a) , 
a>0 i 


Inverse mul tiquadr ic 


(|| x - y || 2 + c 2 )- 1/2 


Multiquadric 


(| | x - y || 2 + c 2 ) * 


Thin plate splines 


ii x - y ir* 1 

|| x - y || 2a ln(|| x - y ||) 


Multi layer perceptron 


tanh( x"y-0 ) , for suitable 

e 


Ploynomial of degree d 


(1 + xy ) d 


B splines 


B 2n+ x(x - y) 


Trigonometric polynomials 


sin(( d +1/2 ) (x-y))/sin((x- 
y)/2) 



10 Table 2: Examples of kernel functions 

In Table 2 the last two kernels are one dimensional i.e. for 
the case when X has only one column. Multivariate versions 
can be derived from products of these kernel functions. The 

15 definition of B 2n +i can be found in De Boor (1978 ). Use of a 
kernel function in either a generalised linear model or a 
quasi likelihood model results in mean values which are 
smooth (as opposed to transforms of linear) functions of the 
covariates X. Such models may give a substantially better 

20 fit to the data. 
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A fourth embodiment relating to a proportional hazards model 
will now be described. 

5 D. Proportional Hazard Models 

The method of this embodiment may utilise training samples 
in order to identify a subset of components which are 
capable of affecting the probability that a defined event 

10 (eg death, recovery) will occur within a certain time 

period* Training samples are obtained from a system and the 
time measured from when the training sample is obtained to 
when the event has occurred. Using a statistical .method to 
associate the time to the event with the data obtained from 

15 a plurality of training samples, a subset of components may 
be identified that are capable of predicting the 
distribution of the time to the event. Subsequently, 
knowledge of the subset of components can be used for tests, 
for example clinical tests to predict for example, 

20 statistical features of the time to death or time to relapse 
of a disease. For example, the data from a subset of 
components of a system may be obtained from a DNA 
microarray. This data may be used to predict a clinically 
relevant event such as, for example, expected or median 

25 patient survival times, or to predict onset of certain 
symptoms, or relapse of a disease. 

In this way, the present invention identifies preferably a 
relatively small number of components which can be used to 

30 predict the distribution of the time to an event of a 

system. The selected components are "predictive" for that 
time to an event. Essentially, from all the data which is 
generated from the system, the method of the present 
invention enables identification of a small number of 

35 components which can be used to predict time to an event. 
Once those components have been identified by this method, 
the components can be used in future to predict statistical 
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features of the time to an event of a system from new 
samples. The method of the present invention preferably 
utilises a statistical method to eliminate components that 
are not required to correctly predict the time to an event 
5 of a system. By appropriate selection of the hyperparameters 
in the model some control over the size of the selected 
subset can be achieved. 

As used herein, "time to an event" refers to a measure of 
10 the time from obtaining the sample to which the method of 

the invention is applied to the time of an event. An event 
may be any observable event. When the system is a 
biological system, the event may be, for example, time till 
failure of a system, time till death, onset of a particular 
15 symptom or symptoms, onset or relapse of a condition or 
disease, change in phenotype or genotype, change in 
biochemistry, change in morphology of an organism or tissue, 
change in behaviour. 

20 The samples are associated with a particular time to am 
event from previous times to an event. The times to an 
event may be times determined from data obtained from, for 
example, patients in which the time from sampling to death 
is known, or in other words, "genuine" survival times, and 

25 patients in which the only information is that the patients 
were alive when samples were last obtained, or in other 
words, "censored" survival times indicating that the 
particular patient has survived for at least a given number 
of days. 

30 

In one embodiment, the input data is organised into an 
nx/>data matrix X = {x {j ) with n test training samples and p 
components. Typically, p will be much greater than n • 

35 For example, consider an Nxp data matrix A r = ^x (/ J from, for 
example, a microarray experiment, with N individuals (or 
samples) and the same p genes for each individual. 
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Preferably, there is associated with each individual 
t (t = l>2r",N) a variable^, (j>,-£0 ) denoting the time to an 
event, for example, survival time. For each individual 
there may also be defined a variable that indicates whether 
5 that individual's survival time is a genuine survival time 
or a censored survival time. Denote the censor indicators 
as Cf where 

{/, if y t is uncensored 
0, if yi is censored 

10 

The JVxl vector with survival times y, may be written as y 
and the iVxl vector with censor indicators c^as c. 

Typically, as discussed above, the component weights are 
15 estimated in a manner which takes into account the a priori 
assumption that most of the component weights are zero. 

Preferably, the prior specified for the component weights is 
of the form 

20 

p(A,A.-.A)- jfi*(flk ) p (*> ) dx < D1 > 

r W 

where A»A»"*»A are component weights, P(/?,|t,) is N\0 9 rf J and 
P^J is some hyperprior distribution 

25 

*<*)-em*) 

that is not a Jeffrey's hyperprior. 

In one embodiment, the prior distribution is an inverse 
30 gamma prior for X in which tf=l/r / 2 has an independent gamma 
distribution with scale parameter b>0 and shape parameter 
k>0 so that the density of tf is 
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-Ktf,b,k)=b- , (t^) k - , exp(-t?/byr(k) . 



It can be shown that: 



E{t 2 |0}= (2k+l)/(2/b+0 2 ) 



(A) 



Equation A can be shown as follows: 
Define 

1 0 I(p,b,k)= ](t 2 y t expC-O^t 2 >y(t 2 ,b,k)dt 2 

0 

then 
15 Proof 



Let s = /3 2 /2 then 



20 



25 



I(p,b,k)=b^ J(t 2 /b)^ exp(-st 2 )T(t 2 ,b,k)dt 2 
o 

Now using the substitution u = t 2 /b we get 

CO 

IfobJ^b 1 ** 5 J(u)^ exp(-subh(u,l,k)du 

0 

Now let s'=bs and substitute the expression for ^>(u,l,k) . This 
gives 

Ifob^H^ 5 J expHs'+l^u^^du/rCk) 
o 

Looking up a table of Laplace transforms, eg Abramowitz and 
Stegun, then gives the result. 



30 



The conditional expectation follows from 

E{t 2 |/3} =1(1^/1(0^) 
= (2k+l)/(2/b + 0 2 ) 
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As k tends to zero and b tends to infinity, a result 
equivalent to using Jeffreys prior is obtained. For example, 
for k=0.005 and b=*2*10 5 

5 

E{t 2 |/3}=(1.01)/(10* 5 +/3 2 ) 

Hence we can get arbitrarily close to a Jeffery's prior with 
this proper prior. 

10 The modified algorithm for this model has 

b (n) =E{t 2 | ^ n> }"° 5 

where the expectation is calculated as above. 

15 

In yet another embodiment, the prior distribution is a gamma 
distribution for T? . Preferably, the gamma distribution has 
scale parameter b>0 and shape parameter k>0. 

20 Xt can be shown, that 

Ju k - 3/2 - , exp«7 j /u+u))du 
b Ju l£ - ,/2l expW7i/u-Ki))du 

0 

= (2 1 1^(2^) 
Vb 1/3,1^(2^) 

1 (2^)K 3 ^(2^) 

I ft I 2 K 1/2 . k (2^) 

where y^fflb and K denotes a modified Bessel function. 
25 Some special members of this class are k=l in which case 



E{r i - 2 }0 i }=V27b(l/!/3 i |) 
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which corresponds to the prior used in the Lasso technique, 
Tibshirani(1996) . See also Figueiredo (2001) . 

The case k=0»5 gives 

5 

B{r f 4 |A>- V^(l/li8J){K I (2^)/K 0 (2V^)} 
or equivalently 

10 E{r: 2 |jS^(l/|/^^^ 

where K 0 and K x are modified Bessel functions, see Abraxnowitz 
and Stegun(197 0) • Polynomial approximations for evaluating 
these Bessel functions can be found in Abramowitz and 
15 Stegun(1970, p379) . 

The expressions above demonstrate the connection with the 
Lasso model and the Jeffreys prior model. 

20 Details of the above calculations are as follows: 



25 



For the gamma prior above and y^fif/Tb 



Ju^expKy/u+u^du 



b Ju k, ^ , exp(-(7,/u+u))du 



= (2 1 K^(2^) {D2) 
V b | ft | K 1/24c (2^T) 

1 (2^)^(2^) 
Iftf K 1/24t (2^) 



where K denotes a modified Bessel function, 
For k=l in (D2) 



E{r: 2 ift}= V5/b(l/|ft|) 
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For K=0.5 in <D2) 

E{t: 2 |^}= ,/^(l/|0j|){K 1 (2^)/K o <2^)} 

5 

or equivalent ly 

E{r,- 2 = (1/IAf ){2^K 1 (2 x /^)/K 0 (2 > /^)} 

10 Proof 

From the definition of the conditional expectation, writing 
7,=0 2 /2b, we get 

]rr 2 r/ , exp(-r 1 .rr 2 )b- , (r i - 2 /b) k - , exp(r | - 2 /b)dr i 2 
E{T .*,0. }=! JL_ 

jT i - , e X p(-r^)b- , (r j - 2 /b) k, exp(r j * 2 /b)dr i 2 

0 

15 Rearranging, simplifying and making the substitution u-yf/b 
gives A.l 

The integrals in A.l can be evaluated by using the result 

20 J*-*- exp|-(r + ^-}Jfc ^K„(2a) 

where K denotes a modified Bessel function, see 
Watson(1966) • 

Xn a particularly preferred embodiment, p{r^a\jxf is a 
25 Jeffreys prior, Kotz and Johnson (1983) • 

The likelihood function defines a model which fits the data 
based on the distribution of the data* Preferably, the 
likelihood function is of the form: 

30 

N 

Log (Partial) Likelihood = ^g i {fi,p;X,y,c) 

/=1 
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where J3 T = {0\>fi2>~'fip) and ^ =(w»ff2'*"'Wy) are the n^del 
parameters. The model defined by the likelihood function 
may be any model for predicting the time to an event of a 
system. 

5 

Xn one embodiment, the model defined by the likelihood is 
Cox's proportional hazards model. Cox's proportional hazards 
model was introduced by Cox (1972) and may preferably be 
used as a regression model for survival data. In Cox's 

10 proportional hazards model, fi l is a vector of (explanatory) 
parameters associated with the components. Preferably, the 
method of the present invention provides for the 
parsimonious selection (and estimation) from the parameters 
p 1 ~{P\,fh>*">Pp) £or Cox's proportional hazards model given 

15 the data X w y and c. 

Application of Cox's proportional hazards model can be 
problematic in the circumstance where different data is 
obtained from a system for the same survival times, or in 

20 other words, tied survival times. Tied survival times may 
be subjected to a pre-processing step that leads to unique 
survival times. The pre-processing proposed simplifies the 
ensuing code as it avoids concerns about tied survival times 
in the subsequent application of Cox's proportional hazards 

2 5 model. 

The pre-processing of the survival times applies by adding 
an extremely small amount of insignificant random noise. 
Preferably, the procedure is to take sets of tied times and 

30 add to each tied time within a set of tied times a random 
amount that is drawn from a normal distribution that has 
zero mean and variance proportional to the smallest non-zero 
distance between sorted survival times. Such pre-processing 
achieves an elimination of tied times without imposing a 

35 draconian perturbation of the survival times. 
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The pre-processing generates distinct survival times. 
Preferably, these times may be ordered in increasing 

magnitude denoted as £ = ('(7)''(2)'* 'fyv))' > '(0 * 

Denote by Z the Nxp matrix that is the re -arrangement 
of the rows of X where the ordering of the rows of 
Z corresponds to the ordering induced by the ordering of £; 
also denote by Zj the 7 th row of the matrix Z . Let d be the 
result of ordering c with the same permutation required to 
order * • 

After pre-processing for tied survival times is taken 
into account and reference is made to standard texts on 
survival data analysis (eg Cox and Oakes, 1984), the 
likelihood function for the proportional hazards model may 
preferably be written as 



20 



'(*!£) -ft 



exp 



\ d J 



X exp(Zi/3) 



(D3) 



where fi T ~,P n ) > z i = the J*" 1 row of z > a*" 1 

Sflj = {i :i = y,y+l, ",jV} = the risk set at the j tb ordered event 

time tfj} . 

The logarithm of the likelihood (ie L — log(l)) may 
preferably be written as 



N 



i=l 



ZiP-log 



N 



N 



U= y J) 



(D4) 



where 



0, if j<i 
[J, it j*i 
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Notice that the model is non- parametric in that the 
parametric form of the survival distribution is not 
specified - preferably only the ordinal property of the 
5 survival times are used (in the determination of the risk 
sets) . As this is a non-parametric case <p La not required 
(ie <gr=0) . 

In another embodiment of the method of the invention, the 
10 model defined by the likelihood function is a Parametric 
survival model. Preferably, in a parametric survival model , 
J3 J is a vector of (explanatory) parameters associated with 
the components, and <p* is a vector of parameters associated 
with the functional form of the survival density function. 

15 

Preferably, the method of the invention provides 
for the parsimonious selection (and estimation) from the 
parameters /}* and the estimation of <p l =[<Pi.p2'" m *Vq) for 
parametric survival models given the data X , y and c . 

20 

In applying a parametric survival model, the survival times 
do not require pre-processing and are denoted as 
The parametic survival model is applied as follows* 
Denote by f{y;0>ft 9 X}tha parametric density function of the 

25 survival time, denote its survival function by 

00 

S{y;q>,J$,x}*s ^f{u;<p,p t Xytu where <p are the parameters relevant 
y 

to the parametric form of the density function and fi 9 X are 
as defined above. The hazard function is defined as 

h(y i ;9.fi.x) = f(y i ;9.^ . 

30 

Preferably, the generic formulation of the log- likelihood 
function, taking censored data into account, is 
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Reference to standard texts on analysis of survival time 
data via parametric regression survival models reveals a 
collection of survival time distributions that may be used. 
Survival distributions that may be used include , for 
example, the Weibullr Exponential or Extreme Value 
distributions . 

If the hazard function may be written as 

= then s( y/ ;^^A-) = e^(-yl(^; ? )^J and 

f{yi;<p>P.x) = *(yt;?)ew^ where A(yt;<p)= 

is the integrated hazard function and Xyy^q))- * ; x± is 

v ~' dy i 

the X th row of X. 



The Weibull, Exponential and Extreme Value distributions 
have density and hazard functions that may be written in the 
form of those presented in the paragraph immediately 
above* 

The application detailed relies in part on an algorithm of 
Aitken and Clayton (1980) however it permits the user to 
specify any parametric underlying hazard function. 



Following from Aitkin and Clayton (1980) a preferred 
likelihood function which models a parametic survival model 
is: 



1=1 



f f Y\ 



(D5) 



where = A^y^tpjexp^X^J ♦ Aitkin and Clayton (1980) note 
that a consequence of equation (11) is that the ^'s may be 
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treatedas Poisson variates with means //,-and that the last 

term in equation (11) does not depend on /} (although it 
depends on <p ) . 

5 Preferably, the posterior distribution of fi , q> and r given 
y is 



10 



wherein I[y\P>f> 9 cJ is the likelihood function. 



In one embodiment , v may be treated as a vector of missing 
data and an iterative procedure used to maximise equation 
(D6) to produce a posteriori estimates of /? . The prior of 
equation (Dl) is such that the Tnavi' mi Tn a posteriori 
15 estimates will tend to be sparse i.e. if a large number of 
parameters are redundant, many components of ft will be 
zero. 

Because a prior expectation exists that many components of 
20 JS M are zero, the estimation may be performed in such a way 
that most of the estimated #'s are zero and the remaining 
non-zero estimates provide an adequate explanation of the 
survival times. 

25 In the context of microarray data this exercise translates 
to identifying a parsimonious set of genes that provide an 
adequate explanation for the event times. 

As stated above, the component weights which maximise the 
30 posterior distribution may be determined using an iterative 
procedure. Preferable, the iterative procedure for 
maximising the posterior distribution of the components and 
component weights is an EM algorithm, such as, for example, 
that described in Dempster et al, 1977. 

35 



WO 2004/104856 



PCT/AU2004/000696 



- 75 - 



If the E step of the EM algorithm is examined, from (D6) 



ignoring terms not involving beta, it is necessary to 



compute 



i-i 



(D7) 



=I>?/d? 



n 



5 where d % = = i?{l/vf | fi i }~ 0 ' 5 and for convenience we define 

</, = l/rf, = 0ifj§, = 0. In the following we write d = d(J3)=(d l J 2 ,...J n ) T • 
In a similar manner we define for example d(J3 {n) ) and 
d(y in) )=Pjd(Py tt) ) where fP*=P n y w and P n is obtained from the p 
by p identity matrix by omitting columns j for which JS^-O. 

10 

Hence to do the E step requires the calculation of the 
conditional expected value of t^l/r, 2 when />(/?, |r, 2 ) ±s N{0 9 rf) 

and p(r*)has a specified prior distribution as discussed 
above • 

15 

In one embodiment, the EM algorithm comprises the steps : 

1. Choose the hyperprior and values for its parameters , 
namely b and k. Initialising the algorithm by setting n=0, 



20 So - {1,2,„, p } . initialise fi^ = fi* , <p^°K 



2. Defining 



' 1 0, 




25 



and let P n be a matrix of zeroes and ones such that the 
nonzero elements y^ n} of flS") satisfy 



<D8) 
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3 . Performing an estimation step by calculating the expected 
value of the posterior distribution of component weights. 
This may be performed using the function: 

5 

QiP\P (n \<P (n) ) = ^{log(^(A ? ,r|y))|^^ (n) ,^ {,,) } 

(D9) 



where L is the log likelihood function of y ♦ Using 
/? = P n y and /S (n) = P R y (n) we have 

io Q(y\r { *\<P {n) ) =i(£l^? w )-^r T A(rf(/ (w >))y (dio) 

4. Performing the maximisation step. This may be 

performed using Newton Raphson iterations as follows: 

Set = and for r= 0,1,2,... y r +\ = y r + a r S r where ar r is 

chosen by a line search algorithm to ensure Q(y r+l \ > 

is Q(r_ r I r ( V°) ' and 

5 r =A(d(yW))[-A(d(y<->))^A(d(y^))+/] , (^-^^) 

" h °" f = *"f ' ^ <D11) 

Let be the value of y r when some convergence 
criterion is satisfied e.g \\y r - Y r +\\\ < £ (for example £ = 1(T 5 ) 

20 5. Define J3* -P n y* , S n = j*.i fit \ > £{ max\pj || where S\ is a small 
constant, say 10~ 5 . Set 11=11+1, choose = p( n ) + K n ^<p* — j»W j 
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20 



dL(y\P n r\<p) 

where <p satisfies — — = 0 and K n is a damping factor 

d<p 

such that 0 < K n < 1 . 

6. Check convergence. If \\y* — y*"' \\ <e 2 where e 2 is suitably 
small then stop, else go to step 2 above. 

In another embodiment, step (Dll) in the maximisation step 
may be estimated by replacing — — with its expectation 

d 2 % 



10 In one embodiment, the EM algorithm is applied to maximise 
the posterior distribution when the model is Cox's 
proportional hazard' s model • 

To aid in the exposition of the application of the EM 
15 algorithm when the model is Cox's proportional hazards 
model, it is preferred to define ^dynamic weights" and 
matrices based on these weights. The weights are - 

w u=-& ' 

* N 



Matrices based on these weights are 
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W = 



*2 



w Nj 



f * 



0] 



In terms of the matrices of weights the first and 
second derivatives of L may be written as 



dL t ~ 

dp 



dp 2 



= Z T {w** -^*))z = Z T KZ 



(D12) 



where K.=W -A\W J. Note therefore from the transformation 
matrix P„ described as part of Step (2) of the EM algorithm 
(Equation D8) (see also Equations (Dll) ) it follows that 



10 



dy r 3/3, 



(D13) 
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Preferably , when the model is Cox's proportional hazards 
model the E step and M step of the EM algorithm are as 
follows : 

1. Choose the hyperprior and its parameters b and k. Set 
nsO, S 0 - {1,2, • • • , p}. Let v be the vector with 
components 



y i U , if c,=0 



10 for some small e r say .001. Define f to be log(v/t) 

If p £ N compute initial values p* by 

f=(Z T Z + Xr> A Z T f 
15 Xf p > N compute initial values /Thy 

-Z T {ZZ T +XI) l Z)Z T f 
where the ridge parameter X satisfies 0 < X < 1. 
20 2. Define 

' 0, otherwise 

Let P„be a matrix of zeroes and ones such that the nonzero 

elements y^ot /?M satisfy 

25 1 ~ ' ~ ~ 

3. Perform the E step by calculating 
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QW\/}< n) ) = £{log(p(£, ? ,r| £ ))| /,£<">} 

where Lis the log likelihood function of jf given by 
Equation (8)- Using p = P n y and = Z^y 00 we have 

5 Q(y\r (n) ) =mPnr)-\r T Kd(y in) ))Y 

4. Do the M step* This can be done with Newton Raphson 
iterations as follows. Set y 0 =y^and for r=0 # l,2 # _ 
y r+1 a: y r + a r c£ r where a r is chosen by a line search 
algorithm to ensure Q\y„ x \ jfW*')><2(r, I £ W ,<'J • 

10 For p £ N use 

where 7 = ZP n A [d(y} n ^ ^ 
For p > N use 

Let y* be the value of y r when some convergence 
15 criterion is satisfied e.g J | y r - y r +i| j < e (for example 
10 s ) . 

5. Define p* =P n y , S n -\i:\fii\ >£imax\/3;\\ where £\ is a 



p*=P n y, S n =\i:\Pi\ >eimax\Pj^ 



small constant/ say 1<T 5 . This step eliminates variables 
with very small coefficients. 

20 6. Check convergence. If \\y* -jX") \\< e 2 where e 2 is suitably 

small then stop, else set n=n+l, go to step 2 above and 
repeat procedure until convergence occurs. 
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In another embodiment: the EM algorithm is applied to 
maximise the posterior distribution when the model is a 
parametric survival model. 



10 



15 



In applying the EM algorithm to the parametic survival 
model, a consequence of equation (11) is that the c±' s may- 
be treated as Poisson variates with means fi { ± and that the 
last term in equation (11) does not depend on fi (although 
it depends on <p) . Note that /og(/i / ) = /og^yl^y / ;^JJ+ X t p and so 
it is possible to couch the problem in terms of a log-linear 
model for the Poisson-like mean. Preferably, an iterative 
maximization of the log- likelihood function is performed 
where given initial estimates of q> the estimates of p are 
obtained* Then given these estimates of fi , updated 
estimates of <p are obtained. The procedure is continued 
until convergence occurs. 

Applying the posterior distribution described above, we 
note that (for fixed q>) 



dp M dp dp 



dp 



dp 



(D14) 



20 Consequently from Equations (11) and (12) it follows that 



dL 



= X T {c~fA and Z~±.~-X T a(/j)x 
dP V ~ C} dp 2 u/ 



The versions of Equation (12) relevant to the parametric 
survival models are 



25 



dy r " dp, " ^ V 



dL 



d 2 L T d 2 L 

dy) " dp; 



(D15) 



To solve for q> after each M step of the EM algorithm (see 
step 5 below) preferably put <p^ n ^ = qfc 1 * 1 } +K n where <p 
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satisfies — = 0 for 0<K n £l and p is fixed at the value 

dq> 

obtained from the previous M step. 

It is possible to provide an EM algorithm for parameter 
selection in the context of parametric survival models and 
5 microarray data* Preferably , the EM algorithm is as 
follows : 

1. Choose a hyper prior and its parameters b and k eg b-le7 
and k=0.5. Set n=0, S 0 * p} qfc mtial ) = ^(°) . Let v be 

10 the vector with components 




if cH 
if c^Q 



for some small e , say for example .001. Define f to be 
15 log(v/A(y,<p)). 

If p £N compute initial values /?* by /?* =(X T X + AJ)' 1 X T f ♦ 
If p > W compute initial values p by 

yf .x*(XX T +XI) A X)X T f 
20 where the ridge parameter A satisfies 0 < X <> 1. 



2 . Define 

' 1 0, otherwise 
Let i^be a matrix of zeroes and ones such that the nonzero 



25 elements y^of p( n ) satisfy 



r = PIP 
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3 . Perform the E step by calculating 

5 

where LLa the log likelihood function of y and q£ n ) • 
Using 0 = P n y and /? (n) - we have 

4. Do the M step. This can be done with Newton Raphson 
10 iterations as follows* Set y Q =y^and for r«0,l,2,.« 

Yr+\ = Yr + a r €r where a r is chosen by* a line search 
algorithm to ensure Q{y r+l \ y W ,p w ) >Q(y t I Y { *\<p la) ) . 
For p S N use 

where r = Ay„A(j(r w )). 
15 For p > N use 

Let y* be the value of y r when some convergence 
criterion is satisfied e.g | f Yr - Yr+il I < e (for example 

10~ 5 ) • 

20 5, Define ft* ~P n y* , 5„=|i;|/? / | > S\ max\f}j || where e\ is a small 
constant, say 1CT 5 . Set xi=:ri+l, choose q>^ n ^ =qk n } 4*/c w ^* — 
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dL(y\P„ r \q>) 

where <p satisfies — ^ = — ^- = 0 and K n is a damping 

d<p 

factor such that 0<K n <l. 

6. Check convergence. If -y} n} l|< *2 where s 2 is suitably 

5 small then stop, else go to step 2. 

In another embodiment , survival times are described by a 
Weibull survival density function. For the Weibull case q> 
ia preferably one dimensional and 

10 

A(y;<p) = y a , 
<p~a 

XT N 

Preferably, — =— +y(q (>>/) = 0 ia solved 

da a ^ 

after each M step so as to provide an updated value of a . 

15 Following the steps applied for Cox's proportional hazards 
model/ one may estimate a and select a parsimonious subset 
of parameters from p that can provide an adequate 
explanation for the survival times if the survival times 
follow a Weibull distribution. A numerical example is now 

2 0 given . 

The preferred embodiment of the present invention will now 
be described by way of reference only to the following non- 
limiting example. It should be understood, however, that 
25 the example is illustrative only, should not be taken in any 
way as a restriction on the generality of the invention 
described above. 



Example: 
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Pull normal regression example 201 data points 41 basis 
functions 

k=0 and b=le7 

5 the correct four basis functions are identified namely 
2 12 24 34 

estimated variance is 0.67. 

With k=0.2 and b=le7 
10 eight basis functions are identified, namely 
2 8 12 16 19 24 34 

estimated variance is 0 . 63 . Note that the correct set of 
basis functions is included in this set. 

15 The results of the iterations for k~0.2 and b=le7 are given 
below. 



Iteration: 0 expected post: 2 basis fns 41 



20 sigma squared 0.6004567 

EM Iteration: 1 expected post: -63.91024 basis fns 41 



25 



sigma squared 0.6037467 

EM Iteration: 2 expected post? 

sigma squared 0.6081233 

EM Iteration: 3 expected post! 



-52.76575 basis fns 41 



•53.10084 basis fns 30 



sigma squared 0.6118665 
30 EM Iteration: 4 expected post: -53.55141 basis fns 22 



sigma squared 0.6143482 

EM Iteration: 5 expected post: 



53.79887 basis fns 18 



35 sigma squared 0.6155 

EM Iteration: 6 expected post: 



-53.91096 basis fns 18 
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sigma squared 0.6159484 

EM Iteration: 7 expected post: -53.94735 basis fns 16 

sigma squared 0.6160951 
5 EM Iteration: 8 expected post: -53.92469 basis fns 14 

sigma squared 0*615873 

EM Iteration: 9 expected post: -53.83668 basis fns 13 

10 sigma squared 0.6156233 

EM Iteration: 10 expected post: -53.71836 basis fns 13 

sigma squared 0.6156616 

EM Iteration: 11 expected post: -53.61035 basis fns 12 

15 

sigma squared 0.6157966 

EM Iteration: 12 expected post: -53.52386 basis fns 12 

sigma squared 0.6159524 
20 EM Iteration: 13 expected post: -53.47354 basis fns 12 

sigma squared 0.6163736 

EM Iteration: 14 expected post: -53*47986 basis fns 12 

25 sigma squared 0.6171314 

EM Iteration: 15 expected post: -53.53784 basis fns 11 

sigma squared 0.6182353 

EM Iteration: 16 expected post: -53.63423 basis fns 11 

30 

sigma squared 0.6196385 

EM Iteration: 17 expected post: -53.75112 basis fns 11 

sigma squared 0.621111 
35 EM Iteration: 18 expected post: -53.86309 basis fns 11 



sigma squared 0.6224584 
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EM Iteration: 19 expected post: -53.96314 basis fns 11 
sigma squared 0,6236203 

EM Iteration: 20 expected post: -54.05662 basis fns 11 

5 

sigma squared 0.6245656 

EM Iteration: 21 expected post: -54.13 82 basis fns 10 

sigma squared 0.6254182 
10 EM Iteration: 22 expected post: -54.21169 basis fns 10 

sigma squared 0.6259266 

EM Iteration: 23 expected post: -54.25395 basis fns 9 

15 sigma squared 0.62 59266 

EM Iteration: 24 expected post: -54.26136 basis fns 9 

sigma squared 0.626023 8 

EM Iteration: 25 expected post: -54.25962 basis fns 9 

20 

sigma squared 0.6260203 

EM Iteration: 26 expected post: -54.25875 basis fns 8 

sigma squared 0.6260179 
25 EM Iteration: 27 expected post: -54.25836 basis fns 8 

sigma squared 0.626017 

EM Iteration: 28 expected post: -54.2582 basis fns 8 

30 sigma squared 0.6260166 

For the reduced data set with 201 observations and 10 
variables, k=0 and b=le7 

Gives the correct basis functions, namely 12 3 4. With 
35 k=0.25 and b=le7, 7 basis functions were chosen, namely 1 2 
3 4 6 8 9. A record of the iterations is given below. 
Note that this set also includes the correct set. 
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EM Iteration: 0 expected post: 2 basis fns 10 

sigma squared 0.6511711 
5 EM Iteration: 1 expected post: -66.18153 basis fns 10 

sigma squared 0.65162 89 

EM Iteration: 2 expected post: -57.69118 basis fns 10 

10 sigma squared 0.6518373 

EM Iteration: 3 expected post: -57.72295 basis fns 9 



15 



sigma squared 0.6518373 

EM Iteration: 4 expected post: 

sigma squared 0.65188 

Iteration: 5 expected post: 



-57.74616 basis fns 8 



-57.75293 basis fns 7 



sigma squared 0.6518781 

2 0 Ordered categories examples 

Luo prostate data 15 samples 9605 genes. For k=0 and b=le7 
we get the following results 

misclassif ication table 
25 pred 

y 12 3 4 

1 4 0 0 0 

2 0 2 1 0 

3 0 0 4 0 
30 4 0 0 0 4 

Identifiers of variables left in ordered categories model 
6 611 



35 For k=0.25 and b=le7 we get the following results 



misclassif ication table 
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pred 
y 1 2 3 4 

1 4 0 0 0 

2 0 3 0 0 
5 3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
6611 7188 

10 

Note that we now have perfect discrimination on the training 
data with the addition of one extra variable. A record of 
the iterations of the algorithm is given below* 

15 ******************************************* 

Iteration 1 : 11 cycles, criterion -4.661957 

misclassif ication matrix 
fhat 
20 f 12 

1 23 0 

2 0 22 

row =true class 

25 Class 1 Number of basis functions in model : 9608 
*********************************************** 

Iteration 2 : 5 cycles, criterion -9.536942 

misclassif ication matrix 
30 fhat 

f 12 

1 23 0 

2 1 21 

row =true class 

35 

Class 1 Number of basis functions in model : 6431 
*********************************************** 
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Iteration 3 * 4 cycles, criterion -9.007843 

misclassif ication matrix 
fhat 
5 f 12 

1 23 0 

2 0 22 

row =true class 

10 Class 1 Number of basis functions in model : 50 8 
*********************************************** 

Iteration 4 i 5 cycles, criterion -6.47555 

misclassif ication matrix 
15 fhat 

£ 12 

1 23 0 

2 0 22 

row -true class 

20 

Class 1 Number of basis functions in model : 62 
*********************************************** 

Iteration 5:6 cycles, criterion -4.126996 

25 misclassif ication matrix 
fhat 
f 12 

1 23 0 

2 1 21 

30 row =true class 

Class 1 Number of basis functions in model : 20 
********************************* 

Iteration 6 ; 6 cycles, criterion -3.047699 

35 



misclassif ication matrix 
fhat 
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f 12 

1 23 0 

2 1 21 

row =true class 

5 

Class 1 Number of basis functions in model : 12 
**************************************** 

Iteration 7 : 5 cycles, criterion -2 * 610974 

10 misclassif ication matrix 
fhat 
f 12 

1 23 0 

2 1 21 

15 row =true class 

Class 1 s Variables left in model 
12 3 408 846 6614 7191 8077 
regression coefficients 
20 28.81413 14.27784 7.025863 -1. 086501e-06- 4.553004e-09 - 
16.25844 0.1412991 -0.04101412 

*********************************************** 

Iteration 8 * 5 cycles, criterion -2.307441 

25 

misclassif ication matrix 

fhat 
f 1 2 
1 23 0 
30 2 1 21 

row strue class 

Class 1 * Variables left in model 
12 3 6614 7191 8077 
35 regression coefficients 

32.66699 15.80614 7.86011 -18.53527 0.1808061 -0.006728619 
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*********************************************** 

Iteration 9 t 5 cycles, criterion -2.028043 

misclassif ication matrix 
5 fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

10 

Class 1 t Variables left in model 
12 3 6614 7191 8077 
regression coefficients 

36-11990 17.21351 8.599812 -20.52450 0.2232955 -0.0001630341 

15 

*********************************************** 
Iteration 10 : 6 cycles, criterion -1.808861 

misclassif ication matrix 
20 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

25 

Class 1 : Variables left in model 
12 3 6614 7191 8077 
regression coefficients 

39.29053 18.55341 9.292612 -22.33653 0.260273 -8 . 696388e-08 

30 

*********************************************** 
Iteration 11 : 6 cycles, criterion -1.656129 

misclassif ication matrix 
35 fhat 

f 12 
1 23 0 
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2 0 22 
row =true class 



Class 1 : Variables left in model 
5 12 3 6614 7191 

regression coefficients 

42.01569 19.73626 9*90312 -23.89147 0.2882204 

*********************************************** 

10 Iteration 12 : 6 cycles, criterion -1.554494 

misclassification matrix 

fhat 
f 12 
15 1 23 0 

2 0 22 
row =true class 

Class 1 : Variables left in model 
20 12 3 6614 7191 

regression coefficients 

44*19405 20.69926 10.40117 -25.1328 0.3077712 
*********************************************** 
Iteration 13 : 6 cycles, criterion -1.487778 

25 

misclassification matrix 

fhat 
f 12 
1 23 0 
30 2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
35 regression coefficients 

45.84032 21.43537 10.78268 -26.07003 0.3209974 
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Iteration 14 . 6 cycles, criterion -1*443949 

misclassification matrix 
5 fhat 
f 12 

1 23 0 

2 0 22 

row «true class 

10 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

47-03702 21.97416 11*06231 -26*75088 0.3298526 

15 

Iteration 15 : 6 cycles, criterion -1.415 

misclassification matrix 
20 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

25 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

47.88472 22.35743 11.26136 -27.23297 0.3357765 

30 

******************************^ 

Iteration 16 : 6 cycles, criterion -1.395770 

misclassification matrix 
35 fhat 

f 12 
1 23 0 
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2 0 22 
row =true class 

Class 1 : Variables left in model 
5 1 2 3 6614 7191 

regression coefficients 

48*47516 22.62508 11.40040 -27.56866 0.3397475 

****************************************** 

10 Iteration 17 : 5 cycles, criterion -1.382936 

misclassif ication matrix 

f hat 
f 12 
15 1 23 0 

2 0 22 
row =true class 

Class 1 : Variables left in model 
20 1 2 3 6614 7191 

regression coefficients 

48.88196 22.80978 11.49636 -27.79991 0.3424153 

*********************************************** 
25 Iteration 18 : 5 cycles, criterion -1.374340 

misclassif ication matrix 

fhat 
f 12 
30 1 23 0 

2 0 22 
row =true class 

I 

Class 1 : Variables left in model 
35 1 2 3 6614 7191 

regression coefficients 

49.16029 22.93629 11.56209 -27.95811 0.3442109 
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*********************************************** 

Iteration 19 : 5 cycles, criterion -1.368567 

5 misclassification matrix 
fhat 
f 12 

1 23 0 

2 0 22 

10 row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 
15 49.34987 23.02251 11.60689 -28.06586 0.3454208 

*********************************************** 

Iteration 20 s 5 cycles, criterion -1.364684 

2 0 misclassification matrix 
fhat 
f 12 

1 23 0 

2 0 22 

25 row =true class 

Class 1 s Variables left in model 
12 3 6614 7191 
regression coefficients 
30 49.47861 23.08109 11.63732 -28.13903 0.3462368 

*********************************************** 
Iteration 21 : 5 cycles, criterion -1.362068 

35 misclassification matrix 
fhat 
f 12 
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1 23 0 

2 0 22 

row ctrue class 



5 Class 1 s Variables left in model 
12 3 6614 7191 
regression coefficients 

49*56588 23.12080 11.65796 -28.18862 0.3467873 

10 *********************************************** 
Iteration 22 * 5 cycles, criterion -1.360305 

misclassification matrix 
fhat 
15 f 12 

1 23 0 

2 0 22 

row =true class 

2 0 Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49-62496 23.14769 11.67193 -28.22219 0.3471588 

25 *********************************************** 

Iteration 23 t 4 cycles, criterion -1.359116 

misclassification matrix 
fhat 
30 f 12 

1 23 0 

2 0 22 

row -true class 

35 Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 
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49.6649 23-16588 11.68137 -28.2449 0.3474096 

*********************************************** 

Iteration 24 : 4 cycles , criterion -1.3 58314 

5 

misclassification matrix 

fhat 
f 12 
1 23 0 
10 2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
15 regression coefficients 

49.69192 23.17818 11.68776 -28.26025 0.3475789 

*********************************************** 
Iteration 25 : 4 cycles, criterion -1.357772 

20 

misclassification matrix 

fhat 
f 12 
1 23 0 
25 2 0 22 

row strue class 

Class 1 : Variables left in model 
12 3 6614 7191 
30 regression coefficients 

49.71017 23.18649 11.69208 -28.27062 0.3476932 

*********************************************** 

Iteration 26 r 4 cycles, criterion -1.357407 

35 

misclassification matrix 
fhat 
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f 1 2 

1 23 0 

2 0 22 

row = true class 

5 

Class 1 x Variables left In model 
12 3 6614 7191 
regression coefficients 

49.72251 23.19211 11.695 -28.27763 0.3477704 

10 

*********************************************** 
Iteration 27 : 4 cycles, criterion -1.35716 

misclassification matrix 
15 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

20 

Class 1 t Variables left in model 
12 3 6614 7191 
regression coefficients 

49.73084 23.19590 11.69697 -28.28237 0.3478225 

25 

*********************************************** 

Iteration 28 s 3 cycles, criterion -1.356993 

misclassification matrix 
30 fhat 

f 12 

1 23 0 

2 0 22 

row = true class 

35 

Class 1 : Variables left in model 
12 3 6614 7191 
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regression coefficients 

49-73646 23.19846 11.6983 -28.28556 0.3478577 

*********************************************** 
5 Iteration 29 : 3 cycles, criterion -1.356881 

misclassif ication matrix 

fhat 
f 12 
10 1 23 0 

2 0 22 
row atrue class 

Class 1 : Variables left in model 
15 12 3 6614 7191 

regression coefficients 

49.74026 23.20019 11.6992 -28.28772 0.3478814 

*********************************************** 

20 Iteration 30 : 3 cycles, criterion -1.356805 

misclassif ication matrix 

fhat 
f 12 
25 1 23 0 

2 0 22 
row =true class 

Class 1 : Variables left in model 
30 12 3 6614 7191 

regression coefficients 

49.74283 23.20136 11.69981 -28.28918 0.3478975 
1 

35 



misclassif ication table 
pred 
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y 12 3 4 

1 4 0 0 0 

2 0 3 0 0 

3 0 0 4 0 
5 4 0 0 0 4 

Identifiers of variables left in ordered categories model 
6611 7188 

10 Ordered categories example 

Luo prostate data 15 samples 50 genes. For k=0 and b=le7 we 
get the following results 

misclassif ication table 
15 pred 

y 12 3 4 

1 4 0 0 0 

2 0 2 1 0 

3 0 0 4 0 
20 4 0 0 0 4 

Identifiers of variables left in ordered categories model 
1 

25 For k=0.25 and b=le7 we get the following results 

misclassif ication table 

pred 
y 12 3 4 
30 14 0 0 0 

2 0 3 0 0 

3 0 0 4 0 

4 0 0 0 4 

35 Identifiers of variables left in ordered categories model 
1 42 
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A record of the iterations for k=0.25 and b=le7 is given 
below 

*********************************************** 

5 Iteration 1 : 19 cycles, criterion -0.4708706 

misclassification matrix 

fhat 
f 12 
10 1 23 0 

2 0 22 
row =true class 

Class 1 Number of basis functions in model : 53 
15 *********************************************** 

Iteration 2 t 7 cycles, criterion -1. 536822 

misclassification matrix 
fhat 
20 f 12 

1 23 0 

2 0 22 

row =true class 

25 Class 1 Number of basis functions in model : 53 
*********************************************** 

Iteration 3 : 5 cycles, criterion -2.032919 

misclassification matrix 
30 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

35 

Class 1 Number of basis functions in model : 42 
*********************************************** 
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Iteration 4 : 5 cycles, criterion -2.132546 

misclassif ication matrix 
fhat 
5 f 12 

1 23 0 

2 0 22 

row a true class 

10 Class 1 Number of basis functions in model : 18 
*********************************************** 
Iteration 5 : 5 cycles, criterion -1. 978462 

misclassif ication matrix 
15 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

20 

Class 1 Number of basis functions in model : 13 
*********************************************** 
Iteration 6 : 5 cycles, criterion -1.66 8212 

25 misclassif ication matrix 
fhat 
f 12 

1 23 0 

2 0 22 

3 0 row =true class 

Class 1 : Variables left in model 
1 2 3 4 10 41 43 45 
regression coefficients 
35 34.13253 22.30781 13.04342 -16.23506 0.003213167 0.006582334 
-0.0005943874 -3.557023 



VVUU^'llKtgQO innp7/VVWW.g ecuie paiefii.coili/L.OHiiKUoy/^banciY.i uuu i/ rciuim yun twr«»u^ /.^K »o' »»'t'^>^'g- , H cu. L-M.ii.n^v.^. 

WO 2004/104856 PCT/AU20O4/0OO696 

- 104 - 

*********************************************** 

Iteration 7 : 5 cycles , criterion -1. 407871 

misclassif ication matrix 
5 fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

10 

Class 1 : Variables left in model 
1 2 3 4 10 41 43 45 
regression coefficients 

36.90726 24.69518 14.61792 -17.16723 1.112172e-05 5.949931e- 
15 06 -3.892181e-08 -4.2906 

*********************************************** 
Iteration 8 : 5 cycles, criterion -1.244166 

20 misclassif ication matrix 
fhat 
f 12 

1 23 0 

2 0 22 

25 row =true class 

Class 1 : Variables left in model 
1 2 3 4 10 45 
regression coefficients 
30 39.15038 26*51011 15.78594 -17.99800 1.125451e-10 -4.799167 

*********************************************** 
Iteration 9 : 5 cycles , criterion -1.147754 



35 



misclassif ication matrix 

fhat 
f 12 
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1 23 0 

2 0 22 

row =true class 



5 Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

40.72797 27.73318 16.56101 -18.61816 -5.115492 

10 ******************************************** 
Iteration 10 : 5 cycles , criterion -1.09211 

misclassif ication matrix 
fhat 
15 f 12 

1 23 0 

2 0 22 

row -true class 

2 0 Class 1 : Variables left in model 
1 2 3 4 45 

r egr es s ion coe f f i c i en t s 

41.74539 28.49967 17.04204 -19.03293 -5.302421 

25 *********************************************** 

Iteration 11 : 5 cycles, criterion -1.06023 8 

misclassification matrix 
fhat 
30 f 12 

1 23 0 

2 0 22 

row =true class 

35 Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 
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42,36866 28.96076 17-32967 -19.29261 -5.410496 

*********************************************** 

Iteration 12 * 5 cycles, criterion -1.042 037 

5 

misclassif ication matrix 

fhat 
f 12 
1 23 0 
10 2 0 22 

row =strue class 

Class 1 : Variables left in model 
1 2 3 4 45 
15 regression coefficients 

42.73908 29.23176 17.49811 -19.44894 -5.472426 

*********************************************** 
Iteration 13 : 5 cycles, criterion -1.031656 

20 

misclassif ication matrix 

fhat 
f 12 
1 23 0 
25 2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 45 
30 regression coefficients 

42.95536 29.38894 17.59560 -19.54090 -5.507787 

*********************************************** 

Iteration 14 t 4 cycles, criterion -1*025738 

35 

misclassif ication matrix 
fhat 
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f 12 

1 23 0 

2 0 22 

row =true class 

5 

Class 1 s Variables left in model 
1 2 3 4 45 

regression coefficients 

43.08034 29.47941 17.65163 -19.59428 -5.527948 

10 

********** ************************************* 
Iteration 15 : 4 cycles, criterion -1.022366 

tnisclassification matrix 
15 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

20 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

43.15213 29.53125 17*68372 -19.62502 -5.539438 

25 

*********************************************** 

Iteration 16 : 4 cycles, criterion -1.020444 

misclassification matrix 
30 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

35 

Class 1 : Variables left in model 
1 2 3 4 45 
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regression coefficients 

43.19322 29.56089 17.70206 -19.64265 -5.545984 

*********************************************** 
5 Iteration 17 : 4 cycles, criterion -1.019349 

misclassification matrix 

fhat 
f 1 2 
10 1 23 0 

2 0 22 
row = true class 

Class 1 : Variables left in model 
15 1 2 3 4 45 

regression coefficients 

43.21670 29-57780 17.71252 -19.65272 -5.549713 

*********************************************** 
20 Iteration 18 : 3 cycles, criterion -1.018725 

misclassification matrix 

fhat 
f 12 
25 1 23 0 

2 0 22 
row =true class 

Class 1 : Variables left in model 
30 1 2.3 4 45 

regression coefficients 

43.23008 29.58745 17.71848 -19.65847 -5.551837 

*********************************************** 
35 Iteration 19 : 3 cycles, criterion -1.01837 



misclassification matrix 
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fhat 
£12 

1 23 0 

2 0 22 

5 row =true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 
10 43.23772 29-59295 17.72188 -19.66176.-5.553047 

*********************************************** 

Iteration 20 : 3 cycles , criterion -1.018167 

15 misclassif ication matrix 
fhat 
f 12 

1 23 0 

2 0 22 

2 0 row -true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 
25 43.24208 29.59608 17.72382 -19.66363 -5.553737 

*********************************************** 
Iteration 21 : 3 cycles, criterion -1.018052 

3 0 misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

35 row etrue class 

Class 1 : Variables left in model 
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1 2 3 4 45 

regression coefficients 

43.24456 29.59787 17.72493 -19.66469 -5.55413 

5 *********************************************** 

Iteration 22 s 3 cycles, criterion -1.017986 

misclassif ication matrix 
fhat 
10 f 12 

1 23 0 

2 0 22 

row sstrue class 

15 Class 1 s Variables left in model 
1 2 3 4 45 

regression coefficients 

43.24598 29.59889 17.72556 -19.6653 -5.554354 
20 1 

misclassif ication table 

pred 
y 12 3 4 
25 1 4 0 0 0 

2 0 3 0 0 

3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
30 1 42 
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THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS: 

1. A method of identifying a subset of components of a 
system based on data obtained from the system .using at least 
5 one training sample from the system, the method comprising 
the steps of: 

obtaining a linear combination of components of the 
system and weightings of the linear combination of 
components, the weightings having values based on data 

10 obtained from the at least one training sample, the at least 
one training sample having a known feature; 

obtaining a model of a probability distribution of the 
known feature, wherein the model is conditional on the 
linear combination of components; 

15 obtaining a prior distribution for the weighting of 

the linear combination of the components, the prior 
distribution comprising a hyperprior having a high 
probability density close to zero, the hyperprior being such 
that it is not a Jeffreys hyperprior; 

2 0 combining the prior distribution and the model to 

generate a posterior distribution; and 

identifying the subset of components based on a set of 
the weightings that maximise the posterior distribution. 

25 2. The method as claimed in claim 1, wherein the step of 
obtaining the linear combination comprises the step of using 
a Bayesian statistical method to estimate the weightings. 

3. The method as claimed in claim 1 or 2, further 

30 comprising the step of making an apriori assumption that a 
majority of the components are unlikely to be components 
that will form part of the subset of components. 

4. The method as claimed in any one of the preceding 
35 claims, wherein the hyperprior comprises one or more 

adjustable parameters that enable the prior distribution 
near zero to be varied. 
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5. The method as claimed in amy one of the preceding 
claims / wherein the model comprise a mathematical equation 
in the form of a likelihood function that provides the 
probability distribution based on data obtained from the at 
least one training sample* 



10 



6* The method as claimed in claim 5, wherein the 
likelihood function is based on a previously described model 
for describing some probability distribution. 



15 



7* The method as claimed in any one of the preceding 
claims, wherein the step of obtaining the model comprises 
the step of selecting the model from a group comprising a 
multinomial or binomial logistic regression, generalised 
linear model. Cox's proportional hazards model, accelerated 
failure model and parametric survival model. 



8. The method as claimed in claim 7, wherein the model 
20 based on the multinomial or binomial logistical regression 
is in the form of: 



n 



G-\ 

n 



i 



8=1 



*Tfi g 



Tfik 



25 9. The method as claimed in claim 7, wherein the model 
based on the generalised linear model is in the form of: 



L - log p(y | ft 4>) «£ { yA *?'h c(y„<f>) } 



30 



10. The method as claimed In claim 7, wherein the model 
based on the Cox's proportional hazards model is in the form 
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of: 



N 



exp 



£ *xp(z t fi) 



11. The method as claimed in claim 7, wherein the model 
based on the Parametric Survival model is in the form of : 



N 



r 



log 



*(*) 



10 



12. The method as claimed in any one of the preceding 
claims, wherein the step of identifying the subset of 
components comprises the step of using an iterative 
procedure such that the probability density of the posterior 
distribution is maximised. 



15 



20 



25 



13. The method as claimed in claim 12, wherein the 
iterative procedure is an EM algorithm. 

14. A method for identifying a subset of components of a 
subject which are capable of classifying the subject into 
one of a plurality of predefined groups, wherein each group 
is defined by a response to a test treatment, the method 
comprising the steps of: 

exposing a plurality of subjects to the test treatment 

and grouping the subjects into response groups based on 

responses to the treatment; 

measuring components of the subjects; and 
identifying a subset of components that is capable of 

classifying the subjects into response groups using a 

statistical analysis method. 
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15. The method as claimed in claim 14 , wherein the 
statistical analysis method comprises the method as claimed 
in any one of claims 1 to 13. 

5 16. An apparatus for identifying a subset of components of 
a subject, the subset being capable of being used to 
classify the subject into one of a plurality of predefined 
response groups wherein each response group, is formed by 
exposing a plurality of subjects to a test treatment and 
10 grouping the. subjects into response groups based on the 
response to the treatment, the apparatus comprising: 

an input for receiving measured components of the 
subjects; and 

processing means operable to identify a subset of 
15 components that is capable of being used to classify the 

subjects into response groups using a statistical analysis 
method . 

17. The apparatus as claimed in claim 16, wherein the 

2 0 statistical analysis method comprises the method as claimed 

in any one of claims 1 to 15. 

18. A method for identifying a subset of components of a 
subject that is capable of classifying the subject as being 

25 responsive or non-responsive to treatment with a test 
compound, the method comprising the steps of: 

exposing a plurality of subjects to the test compound 
and grouping the subjects into response groups based on each 
subjects response to the test compound; 

3 0 measuring components of the subjects; and 

identifying a subset of components that is capable of 
being used to classify the subjects into response groups 
using a statistical analysis method. 

35 19. The method as claimed in claim 18, wherein the 

statistical analysis method comprises the method as claimed 
in any one of claims 1 to 13. 
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20. An apparatus for identifying a subset of components of 
a subject , the subset being capable of being used to 
classify the subject into one of a plurality of predefined 
5 response groups wherein each response group is formed by 

exposing a plurality of subjects to a compound and grouping 
the s ub jects into response groups based on the response to 
the compound, the apparatus comprising; 

an input operable to receive measured components of 
10 the subjects; 

processing means operable to identify a subset of 
components that is capable of classifying the subjects into 
response groups using a statistical analysis method. 

15 21. The apparatus as claimed in claim 20, wherein the 

statistical analysis method comprises the method as claimed 
in any one of claims 1 to 15 . 

22. An apparatus for identifying a subset of components of 

20 a system from data generated from the system from a 

plurality of samples from the system, the subset being 
capable of being used to predict a feature of a test sample, 
the apparatus comprising; 

a processing means operable to: 

25 obtain a linear combination of components of the system 

and obtain weightings of the linear combination of 
components, each of the weightings having a value based on 
data obtained from at least one training sample, the at 
least one training sample having a known feature; 

30 obtaining a model of a probability distribution of a 

second feature, wherein the model is conditional on the 
linear combination of components; 

obtaining a prior distribution for the weightings of 
the linear combination of the components, the prior 

35 distribution comprising an adjustable hyperprior which 

allows the prior probability mass close to zero to be varied 
wherein the hyperprior is not a Jeffrey's hyperprior; 
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combining the prior distribution and the model to 
generate a posterior distribution* and 

identifying the subset of components having component 
weights that maximize the posterior distribution* 

5 

23. The apparatus as claimed in claim 22 , wherein the 
processing means comprises a computer arranged to execute 
software* 

10 24. A computer program which, when executed by a computing 
apparatus, allows the computing apparatus to carry out the 
method as claimed in any one of claims 1 to 13 . 

25. A computer readable medium comprising the computer 
15 program as claimed in claim 24. 

26. A method of testing a sample from a system to identify 
a feature of the sample, the method comprising the steps of 
testing for a subset of components that are diagnostic of 

20 the feature, the subset of components having been determined 
by using the method as claimed in any one of claims 1 to 15. 

27. The method as claimed in claim 26, wherein the system 
is a biological system. 

25 

28. An apparatus for testing a sample from a system to 
determine a feature of the sample, the apparatus comprising 
means for testing for components identified in accordance 
with the method as claimed in any one of claims 1 to 15. 

30 

29. A computer program which, when executed by on a 
computing device, allows the computing device to carry out a 
method of identifying components from a system that are 
capable of being used to predict a feature of a test sample 

35 from the system, and wherein a linear combination of 

components and component weights is generated from data 
generated from a plurality of training samples, each 
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training sample having a known feature, and a posterior 
distribution is generated by combining a prior distribution 
for the component weights comprising an adjustable 
hyperprior which allows the probability mass close to zero 
5 to be varied wherein the hyperprior is not a Jeffrey's 

hyperprior, and a model that is conditional on the linear 
combination, to estimate component weights which maximise 
the posterior distribution. 

10 30. A method of identifying a subset of components of a 

biological system, the subset being capable of predicting a 
feature of a test sample from the biological system, the 
method comprising the steps of: 

obtaining a linear combination of components of the 
15 system and weightings of the linear combination of 

components, each of the weightings having a value based on 
data obtained from at least one training sample, the at 
least one training sample having a known feature; 

obtaining a model of a probability distribution of the 
20 known feature, wherein the model is conditional on the 
linear combination of components; 

obtaining a prior distribution for the weightings of 
the linear combination of the components, the prior 
distribution comprising an adjustable hyperprior which 
25 allows the probability mass close to zero to be varied; 

combining the prior distribution and the model to 
generate a posterior distribution; and 

identifying the subset of components based on the 
weightings that maximize the posterior distribution. 

30 

DATED this 26 th day of May 2004 
CSIRO 

By their Patent Attorneys 
GRIFFITH HACK 
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